optimization of robust loss functions for weakly-labeled

Int J Comput VisDOI 10.1007/s11263-012-0561-4

Optimization of Robust Loss Functions for Weakly-LabeledImage Taxonomies

Julian J. McAuley · Arnau Ramisa · Tibério S. Caetano

Received: 30 November 2011 / Accepted: 13 August 2012© Springer Science+Business Media, LLC 2012

Abstract The recently proposed ImageNet dataset consistsof several million images, each annotated with a single ob-ject category. These annotations may be imperfect, in thesense that many images contain multiple objects belongingto the label vocabulary. In other words, we have a multi-label problem but the annotations include only a single label(which is not necessarily the most prominent). Such a set-ting motivates the use of a robust evaluation measure, whichallows for a limited number of labels to be predicted and,so long as one of the predicted labels is correct, the over-all prediction should be considered correct. This is indeedthe type of evaluation measure used to assess algorithm per-formance in a recent competition on ImageNet data. Opti-mizing such types of performance measures presents severalhurdles even with existing structured output learning meth-ods. Indeed, many of the current state-of-the-art methods op-timize the prediction of only a single output label, ignoringthis ‘structure’ altogether. In this paper, we show how to di-rectly optimize continuous surrogates of such performancemeasures using structured output learning techniques withlatent variables. We use the output of existing binary clas-sifiers as input features in a new learning stage which op-timizes the structured loss corresponding to the robust per-

J.J. McAuley (�)InfoLab, Stanford University, Gates Building 4A, Stanford,CA 94305-9040, USAe-mail: [email protected]

A. RamisaInstitut de Robòtica i Informàtica Industrial (CSIC-UPC),Parc Tecnologic de Barcelona, c/ Llorens i Artigas 4–6,08028 Barcelona, Spain

T.S. CaetanoPrincipal Researcher, Machine Learning Group, NICTA,Locked Bag 9013, Alexandria NSW 1435, Australia

formance measure. We present empirical evidence that thisallows us to ‘boost’ the performance of binary classificationon a variety of weakly-supervised labeling problems definedon image taxonomies.

Keywords Image labeling · Image tagging · Imagetaxonomies · Structured learning

1 Introduction

The recently proposed ImageNet project consists of build-ing a growing dataset of images, organized into a taxonomybased on the WordNet hierarchy (Deng et al. 2009). Eachnode in this taxonomy includes a large set of images (in thehundreds or thousands). From an object recognition point ofview, this dataset is interesting because it naturally suggeststhe possibility of leveraging the image taxonomy in orderto improve recognition beyond what can be achieved inde-pendently for each image. Indeed this question has been thesubject of much interest recently, culminating in a competi-tion in this context using ImageNet data (Berg et al. 2010;Lin et al. 2011; Sánchez and Perronnin 2011).

Each image in ImageNet may contain several objectsfrom the label vocabulary, however the annotation includesonly a single label per image, and this label is not neces-sarily the most prominent. This ‘imperfect’ annotation sug-gests that a meaningful performance measure in this datasetshould somehow not penalize predictions that contain legit-imate objects that are missing from the annotation. One wayto deal with this issue is to use a robust performance mea-sure based on the following idea: an algorithm is allowed topredict more than one label per image (up to a maximumof K labels, so that the solution is not degenerate), and so

mailto:[email protected]

Int J Comput Vis

long as at least one of those labels agrees with the ground-truth label, no penalty is incurred. This is precisely the typeof performance measure used to evaluate algorithm perfor-mance in the aforementioned competition (Berg et al. 2010).

Another form of ‘weak’ labeling that one typically ob-serves in image datasets is the set of tags associated withan image, i.e., annotations provided by the community ofusers on image hosting websites such as Flickr. As with Im-ageNet data, one observes only positive labels, i.e., we onlyobserve whether an image wasn’t assigned a particular tag,not whether it couldn’t have been. This suggests that sim-ilar performance measures could be used to train a systemfor tag recommendation: it is sufficient that one of the sug-gested tags is similar to one of the groundtruth tags, thoughto our knowledge, this type of robust, hierarchical perfor-mance measure has not been applied to tag prediction.

In this paper, we present an approach for directly optimiz-ing a continuous surrogate of these robust performance mea-sures. In other words, we try to optimize the very measurethat is used to assess recognition quality in the ImageNet2010 Challenge dataset. We show empirically that by usingbinary classifiers as a starting point, which are state-of-the-art for this task, we can boost their performance by meansof optimizing the structured loss. We also apply a variant ofthe same performance measure to the problem of tag recom-mendation, using a recently proposed dataset derived fromFlickr images (Huiskes et al. 2010).

Essentially, we use latent variables to ‘strengthen’ theweakly labeled groundtruth. Intuitively, our latent variablesare designed to represent those objects that appear in animage, but were not annotated. Given that the problem be-comes one of fully-supervised structured learning when thelatent variables are observed, we can use recently proposedtechniques on structured learning with latent variables (Yuand Joachims 2009) to simultaneously optimize the latentvariables and the model parameters.

In addition to experiments on the ImageNet 2010 Chal-lenge dataset, we study labeling problems on two other im-age taxonomies: the MIR Flickr Retrieval Evaluation (MIR,Huiskes and Lew 2008), and the ImageCLEF AnnotationTask (ImageCLEF, Nowak et al. 2011). From the formerdataset we also obtain tag information from Flickr. The la-bel vocabularies in MIR and ImageCLEF are much smallerthan that of the ImageNet 2010 Challenge dataset (24 and 99labels, respectively), meaning that taxonomic information isnot typically used in training or evaluation on these datasets.However, they are useful in the sense that they allow us tostudy the behaviour of the latent variables mentioned above:since these datasets are fully annotated, we can ‘pretend’that they are weakly labeled by withholding part of the an-notation, allowing us to compare the predicted values of thelatent variables with those of the withheld annotation.

Our experiments reveal that our latent variable model isbeneficial for learning in all four of the taxonomies we ex-amine.

An initial version of this paper appeared in McAuleyet al. (2011).

1.1 Literature Review

The success of visual object classification achieved in recentyears is pushing computer vision research towards more dif-ficult goals in terms of the number of object classes and thesize of the training sets used. For example, Perronnin et al.(2010) used increasingly large training sets of Flickr im-ages together with online learning algorithms to improve theperformance of linear SVM classifiers trained to recognizethe 20 Pascal Visual Object Challenge 2007 objects; or Tor-ralba et al. (2008), who defined a gigantic dataset of 75,062classes (using all the nouns in WordNet) populated with 80million tiny images of only 32 × 32 pixels. The WordNetnouns were used in seven search engines, but without anymanual or automatic validation of the downloaded images.Despite their low resolution, the images were shown to stillbe useful for classification.

Similarly, Deng et al. (2009) created ImageNet: a vastdataset with thousands of classes and millions of images,also constructed by taking nouns from the WordNet tax-onomy. These were translated into different languages, andused as query terms in multiple image search engines to col-lect a large amount of pictures. However, as opposed to thecase of the previously mentioned 80 Million Tiny Imagesdataset, in this case the images were kept at full resolu-tion and the labels were manually verified using AmazonMechanical Turk. Currently, the full ImageNet dataset con-sists of over 17,000 classes and 12 million images. Figure 1shows a few example images from various classes.

Deng et al. (2010) performed classification experimentsusing a substantial subset of ImageNet, including more thanten thousand classes and nine million images. Their experi-ments highlighted the importance of algorithm design whendealing with such quantities of data, and showed that meth-ods believed to be better in small scale experiments turnedout to under-perform when brought to larger scales. Also acost function for classification taking into account the hierar-chy was proposed. In contrast with Deng et al. (2010), mostof the works using ImageNet for large scale classificationmake no use of its hierarchical structure.

As mentioned before, in order to encourage large scaleimage classification using ImageNet, a competition using asubset of 1,000 classes and 1.2 million images, called theImageNet Large Scale Visual Recognition Challenge (Berget al. 2010), was conducted together with the 2010 PascalVisual Object Challenge competition. Notoriously, the bet-ter classified participants of the competition used traditional

Int J Comput Vis

Fig. 1 Example images fromImageNet. Classes range fromvery general to very specific,and since there is only one labelper image, it is not rare to findimages with unannotatedinstances of other classes fromthe dataset

one-versus-all approaches and completely disregarded thetaxonomic information.

Lin et al. (2011) obtained the best score in the compe-tition using a conventional one-vs-all approach. Two state-of-the-art coding and pooling techniques, Local CoordinateCoding and Super-Vector Coding, were used to construct thedescriptor vectors for each image. Finally, averaged stochas-tic gradient descent (ASGD) was used to efficiently train athousand linear SVM classifiers.

Sánchez and Perronnin (2011) got the second best scorein the competition (and a posteriori reported better resultsthan those of Lin et al. 2011). In their approach, they usedhigh-dimensional Fisher Kernels for image representationwith lossy compression techniques: first, dimensionality re-duction using Hash Kernels (Shi et al. 2009) was attemptedand secondly, since the results degraded rapidly with smallerdescriptor dimensionality, coding with Product Quantizers(Jégou et al. 2010) was used to retain the advantages of ahigh-dimensional representation without paying a high pricein terms of memory and I/O usage. To train the standard bi-nary one-vs-all linear classifiers, they also used StochasticGradient Descent.

The difficulty of using hierarchical information for im-proving classification may be explained by the findings ofRussakovsky and Fei-Fei (2010). They showed that in Ima-geNet, the relationships endowed by the WordNet taxonomydo not necessarily correspond to visual similarity, and that infact new relations based only on visual appearance informa-tion can be established between some classes, possibly faraway in the hierarchy.

In contrast with the findings of Russakovsky and Fei-Fei(2010), Deselaers and Ferrari (2011) experimentally vali-dated, to a large degree, the common assumptions that se-mantic categories are visually separable and that visual sim-ilarity is correlated with semantic similarity in the ImageNetdataset; this was achieved by comparing the visual variabil-ity of images within a particular class, measured using GIST

signatures. They also studied the relationship between se-mantic and visual distance, and proposed an image distancemeasure based on ImageNet data, termed the ImageNet Dis-tance in their paper, to assess whether two images contain aninstance of the same base-level category. This distance mea-sures the visual and semantic similarity between the cate-gories associated with the nearest neighbors in ImageNet ofthe images being compared.

A different image distance, also based on the ImageNethierarchy, was proposed by Deng et al. (2011). There, theauthors exploit semantic knowledge in the form of a hier-archy to compute a similarity measure for large scale same-category image retrieval. Provided that training data is avail-able on the nodes of the hierarchy, classifiers are learned andmapped to probability values. Then, the class probabilitiesfor two images can be compared using a ‘cost’ matrix thatpenalizes pairs of classes that have their first common an-cestor higher in the hierarchy.

Kim et al. (2011) proposed a method to decompose animage descriptor into a sparse mixture of training ‘base’ de-scriptors to incorporate hierarchical information. The classlabels of the training descriptors active in the mixing weightsfor the query descriptor can be seen as labels for the image.

Cai and Hofmann (2004) defined a taxonomy over cate-gories in a text classification task, and showed that optimiz-ing a structured loss defined on this taxonomy can improveperformance. Binder et al. (2011) discussed similar learningmethods for multiclass image classification, where trainingclasses and examples are organized in a pre-defined taxon-omy. Local SVM learning methods (where descendants ofa node are used as positive examples, and other nodes aretreated as negative examples) were shown to obtain similarperformance to structured SVMs, while requiring less train-ing time and being highly parallelizable. While not alwaysimproving upon the flat classifiers in terms of the 0/1 loss,the taxonomy-based classifiers consistently achieved lowerhierarchical error, which translates into more meaningful,

Int J Comput Vis

‘human-like’, confusions. Multilabel experiments were alsoperformed in Binder et al. (2011), but always with fully an-notated images and in datasets with about 20 classes.

Blaschko et al. (2010) successfully exploited weakly an-notated data to improve the performance of an object de-tection method framed in a structured output formulation.Annotations indicated the presence or absence of objects inimages and weak information about object location. The lo-cations of bounding boxes were treated as latent variables tobe inferred during training, constrained by the type of anno-tation for each image. With this approach and a single fullannotation, they were able to attain performance compara-ble to that obtained with complete annotation in the INRIApedestrian dataset.

Other works aim to learn hierarchies directly from im-age information. Marszałek and Schmid (2008) proposed amethod to construct a Relaxed Hierarchy DAG which avoidsintroducing aliasing due to hard-assigning a class to a partic-ular branch of the tree, instead postponing the decision untilfewer classes are present by assigning it to both branches.With this approach they show performance comparable toone-versus-all classification, while being sub-linear in com-plexity. Bart et al. (2008) proposed a generative modelingapproach similar to Latent Dirichlet Allocation (LDA) fortaxonomy learning. A category is defined as a mixture oftopics, each representing certain features (like sea, sand, fur,sky, etc.). Then, categories are arranged in a tree structure byperforming inference efficiently with a nonparametric priorover tree structures similar to a nested Chinese restaurantprocess, as used in text modeling. Qualitative results aregiven for the Corel dataset, and the method is shown to im-prove with respect to LDA in the 13-scenes dataset. Simul-taneously (and independently), Sivic et al. (2008) also useda hierarchical LDA model with a nested Chinese restaurantprocess prior to discover a taxonomy of categories from vi-sual information without supervision, and also showed im-provements with respect to plain LDA in a pixel-level seg-mentation task using the MSRC-B1 dataset.

As mentioned in the introduction, the problem of tag pre-diction has many parallels with multi-label classification. Inthe following paragraphs we review some recent work ontag prediction.

Verbeek et al. (2010) used the MIR Flickr dataset to eval-uate their previously proposed TagProp algorithm (Guillau-min et al. 2009). TagProp is a weighted nearest-neighbormodel that propagates tag terms among images in a datasetto obtain a more complete annotation. Different visual fea-tures (e.g. SIFT, color histograms) as well as textual fea-tures derived from the tags were used. When compared tostandard SVMs, the TagProp model performed worse whenprecise manual annotations were used, but better when us-ing noisy Flickr tags as training labels. Tag-derived featuresproved beneficial in terms of accuracy both for SVMs andfor TagProp.

Dimitrovski et al. (2010) performed hierarchical classi-fication on the ImageCLEF dataset using a random forestapproach. An ensemble of Predictive Clustering Trees werelearned using multiple types of features. The tags of a novelquery image are determined by propagating it down the treesand averaging the tag probabilities of the selected leaf fromeach tree of the forest. Despite the simplicity of the ap-proach, it achieved the second position in the 2010 Image-CLEF photo annotation competition out of twelve compet-ing groups.

Bucak et al. (2011) addressed the problem of incom-plete annotations in the context of multi-label learning. Theyproposed a group-lasso based method to train a multi-labelmodel that predicts a ranking of classes given a test image.The method was shown to attain results better than those ofa standard SVM in the 2007 Pascal Visual Object Challenge,the MIR, and the ESP Game datasets.

Mensink et al. (2011) proposed a label prediction systemthat uses structured models to learn the dependencies amongimage labels. In an interactive setting, a small amount ofuser input can be used to significantly improve classifica-tion results by selectively asking questions to the user thatminimize uncertainty in the remaining labels. The proposedmodels also improve the results of plain SVM classifiers ina non-interactive (i.e., automatic) setting, although the per-formance gain is modest. The system was tested in Image-CLEF, as well as in the SUN09 and in the Animals withAttributes datasets.

Wang et al. (2011) proposed a semi-supervised image an-notation method that uses a bi-relational graph. The graphcan be divided into a label correlation subgraph and an im-age similarity subgraph, with an additional bipartite sub-graph defined by class assignments to images. A randomwalk with restarts was used to learn class-to-class and class-to-image relevances. In contrast to related work, asymmetricrelationships between classes are considered (e.g. the prob-ability of road given car is not the same as car given road).Finally, a method to learn these bi-directional probabilitieswas proposed and shown to perform better than the symmet-ric version.

Moran and Lavrenko (2011) modified the Continuous-Space Relevance Model of Lavrenko et al. (2003). Ratherthan using the top-ranked tags for an image, a set of tags ispredicted jointly: their approach increases the probability ofpredicting less likely (but consistent) tags and reduces thatof predicting irrelevant or contradictory, but highly scored,ones.

Other works exploited tags as a form of weak supervi-sion, using them to complement purely visual informationfor image classification.

Guillaumin et al. (2010) showed how images with asso-ciated tags, but for which reliable labels are not known, canbe used to complement a potentially smaller set of images

Int J Comput Vis

with both tags and labels and, ultimately, compute better vi-sual classifiers. The motivation of their work comes from theunderstanding that classifiers that exploit image and (weak)textual information significantly outperform those based onvisual features alone, which makes them suitable for usein semi-supervised learning scenarios. This technique facili-tates training on the large amounts of images available fromonline photo sharing sites for which expensive label infor-mation is not available.

Kawanabe et al. (2011) proposed kernels tailored to tagsassociated with Flickr images to complement and improvevisual-feature-based image classification. The authors buildon the ‘tag kernel’ proposed by Guillaumin et al. (2010) andaddress the issue of sparsity in tag-based feature representa-tions by smoothing using Markov Random Walks over thetags. They showed a small but statistically significant im-provement (according to the Wilcoxon test) over the originaltag kernel formulation.

2 Problem Statement

Our notation is summarized in Table 1. We are given thedataset S = {(x1, Y 1), . . . , (xN ,YN)}, where xn ∈ X de-notes an F -dimensional feature vector representing an im-age with a set of groundtruth labels Yn.1 Our goal is to learna classifier Y (x; θ) that for an image x outputs a set of K dis-tinct object categories. The vector θ parameterizes the clas-sifier Y ; we wish to learn θ so that the labels predicted byY (xn; θ) are ‘similar to’ the training labels Yn under someloss function �(Y (xn; θ),Y n). Our specific choice of classi-fier and loss function shall be given in Sect. 2.1. In short, thegoal and contribution of this paper is to learn the classifierY for precisely the loss function � that is used to measureperformance in the ImageNet Large Scale Visual Recogni-tion Challenge (Berg et al. 2010, or just ‘the ImageNet Chal-lenge’ from now on).

We assume an estimator based on the principle of regu-larized risk minimization, i.e., we aim to find θ∗ such that

θ∗ = argminθ

[1

N

N∑n=1

�(Y

(xn; θ)

, Y n)

︸︷︷︸empirical risk

+ λ

2

∥∥θ2∥∥︸︷︷︸

regularizer

]. (1)

Note that in the case of ImageNet, each image is annotatedwith a single label, while the output space consists of a set

1Note that in McAuley et al. (2011) we assumed that there was only asingle groundtruth label yn for each image, as is the case for ImageNet.In the case of the MIR and ImageCLEF datasets there are a variablenumber (possibly zero) of groundtruth labels for each image, hence thechange of notation.

Table 1 Notation

Notation Description

x The feature vector for an image (or just ‘an image’ forsimplicity)

xn The feature vector for the nth training image

X The feature space, i.e., xn ∈ XF The feature dimensionality, i.e., F = |xn|N The total number of training images

y An image label, consisting of a single object class

Yn The set of groundtruth labels for the image xn

C The set of classes, i.e., Yn ⊆ CC The total number of classes, i.e., C = |C|Y (x; θ) The set of output labels predicted by the classifier

Y (x; θ) The output labels resulting in the most violated con-straints during column-generation

Y n Shorthand for Y (xn; θ)

Y n Shorthand for Y (xn; θ)

K The number of output labels produced by the classifier,i.e., K = |Y n| = |Y n|

Y The space of all possible sets of K labels

θ A vector parameterizing our classifier

θy

binary A binary classifier for the class y

λ A constant that balances the importance of the empiricalrisk versus the regularizer

φ(x, y) The joint parameterization of the image x with the labely

Φ(x,Y ) The joint parameterization of the image x with a set oflabels Y

�(Y,Y n) The error induced by the set of labels Y when the correctlabels are Yn

d(y, yn) A distance measure between the two classes y and yn inour image taxonomy

Zn Latent annotation of the image xn, consisting of K −|Yn| object classes distinct from Yn

Ωn The ‘complete annotation’ of the image xn, i.e., Yn ∪Zn

G The number of groundtruth labels used when we analyzethe effect of our latent variables

of K labels; in the other datasets we study, the annotationmay consist of any number of labels, including none (weuse y to denote a single label, Y to denote a set of labels,and Y to denote the space of sets of K labels). This settingpresents several issues when trying to express Eq. (1) in theframework of large-margin structured prediction: primar-ily, the margin between the prediction and the groundtruthis not well-defined when they are drawn from differentspaces (Tsochantaridis et al. 2005), a problem we discuss inSect. 2.3. Perhaps it is for this reason that many of the state-of-the-art methods in the ImageNet Challenge consisted ofbinary classifiers, such as multiclass SVMs, that merely op-timized the score of a single prediction (Lin et al. 2011;Sánchez and Perronnin 2011).

Int J Comput Vis

Motivated by the surprisingly good performance of thesebinary classifiers, in the following sections we shall proposea learning scheme that will ‘boost’ their performance by re-weighting the dimensions of their parameters so as to takeinto account the structured nature of the loss function fromthe ImageNet Challenge.

2.1 The Loss Function

We begin by defining the loss function for the ImageNetChallenge, in which each image is annotated with a singlelabel Yn = {yn}. Each image may contain multiple objectsthat are not labeled, and the labeled object need not neces-sarily be the most salient, so a method should not be penal-ized for predicting ‘incorrect’ labels in the event that thoseobjects actually appear in the scene. Note that this is not anissue in some similar datasets, such as the Caltech datasets(Griffin et al. 2007), where images have been selected toavoid such ambiguity in the labeling, nor in datasets whereall objects are annotated in every image, as in the PascalVisual Object Challenge (Everingham et al. 2010), or inthe MIR and ImageCLEF datasets which we discuss later(Huiskes et al. 2010; Nowak et al. 2011).

To address this issue, a loss is given over a set of pre-dicted output labels Y , that only penalizes the method ifnone of those labels is similar to the annotated object. Fora training image annotated with a single label Yn = {yn},the loss incurred by predicting the set of labels Y is givenby

�(Y,

{yn

}) = miny∈Y

d(y, yn

). (2)

In principle, d(y, yn) could be any difference measure be-tween the classes y and yn. If d(y, yn) = 1−δ(y = yn) (i.e.,0 if y = yn, 1 otherwise), this recovers the ImageNet Chal-lenge’s ‘flat’ error measure. If d(y, yn) is the shortest-pathdistance from yn to the nearest common ancestor of y andyn in a taxonomic tree, this recovers the ‘hierarchical’ errormeasure (which we shall use in our experiments). WordNetis used to build the taxonomic tree for ImageNet, since it isalso the source of the object vocabulary (Miller 1995). Notethat this error measure is not symmetric: no penalty is in-curred if the prediction y is more specific (with respect tothe taxonomy) than the annotation yn, but a penalty is in-curred if the prediction is too general.

For problems where multiple groundtruth annotations areavailable, we desire a loss function that does not penalizethe method so long as any of the predicted labels are similarto any of the groundtruth labels. Using the same differencemeasure d(y, yn) as in Eq. (2), our loss becomes

�(Y,Y n

) = miny∈Y

minyn∈Yn

d(y, yn

)(3)

(if there are no training annotations we define �(Y,∅) = 0).This is certainly not the only loss function we could choosefor multiple labels, but in our case it is motivated by theproblem of tag recommendation: we are satisfied so longas some plausible tags are suggested to the user. Indeed wecould optimize a number of other loss functions using theframework we describe, for example

�(Y,Y n

) = 1

|Yn|∑

yn∈Yn

miny∈Y

d(y, yn

), (4)

though for our experiments we use the loss of Eq. (3).

2.2 ‘Boosting’ of Binary Classifiers

Many of the state-of-the-art methods for image classificationconsist of learning a series of binary ‘one vs. all’ classifiersthat distinguish a single class from all others. That is, foreach class y ∈ C (where C is the object vocabulary), onelearns a separate parameter vector θ

y

binary, and then performsclassification by choosing the class with the highest score,using a classifier of the following form:

ybinary(x) = argmaxy∈C

⟨x, θ

y

binary

⟩. (5)

In order to predict a set of K labels, such methods simplyreturn the labels with the K highest scores:

Ybinary(x) = argmaxY∈Y

∑y∈Y

⟨x, θ

y

binary

⟩, (6)

where Y is the space of sets of K distinct labels. The aboveequations describe many of the competitive methods fromthe ImageNet Challenge, including Lin et al. (2011) andSánchez and Perronnin (2011).

One obvious improvement is simply to learn a new set ofclassifiers {θy}y∈C that optimize the structured error mea-sure of Eq. (1). However, given the large number of classesin the ImageNet Challenge (|C| = 1000), and the high di-mensionality of standard image features, this would mean si-multaneously optimizing several million parameters, whichin our experience proved impractical in terms of runningtime and performance. For the smaller datasets that we study(MIR and ImageCLEF), it is possible to train the individualclassifiers directly so as to optimize Eq. (1), though as weshall report doing so does not lead to good performance.

Instead, we would like to leverage the already good clas-sification performance of existing binary classifiers, simplyby re-weighting their dimensions to account for the struc-tured nature of Eq. (3). Hence we will learn a single param-eter vector θ that re-weights the parameters of every class.Our proposed learning framework is designed to extend lin-ear classifiers of the form given in Eq. (6). Given a set of

Int J Comput Vis

binary classifiers {θy

binary}y∈C , we propose a new classifierof the form

Y (x; θ) = argmaxY∈Y

∑y∈Y

⟨x � θ

y

binary, θ⟩, (7)

where x � θy

binary is simply the Hadamard product of x (the

feature vector) and θy

binary (the parameter vector). Note thatwhen θ = 1 this recovers precisely the original model ofEq. (6).

To use the standard notation of structured prediction, wedefine the joint feature vector Φ(x,Y ) as

Φ(x,Y ) =∑y∈Y

φ(x, y) =∑y∈Y

x � θy

binary, (8)

so that Eq. (6) can be expressed as


〈Φ(x,Y ), θ〉 (9)

(i.e., the predictor is linear in θ ). We will use the shorthandY n := Y (xn; θ) to avoid excessive notation. In the followingsections we shall discuss how structured prediction methodscan be used to optimize models of this form.

2.3 The Latent Setting

As mentioned, the joint parameterization of Eq. (8) is prob-lematic, since the energy of the groundtruth labeling Yn,〈Φ(xn,Y n), θ〉, is not readily comparable with the energyof the predicted output Y , 〈Φ(xn,Y ), θ〉, due to the fact thatthe size of the two sets is in general different, specifically|Yn| ≤ |Y |.

To address this, we propose the introduction of a set oflatent variables, Z = {Z1 . . .ZN }, which for each image xn

is designed to encode the set of objects that appear in xn

that were not annotated. The full set of labels for the imagexn is now Ωn = Yn ∪ Zn (note that Yn ∩ Zn = ∅). If ourmethod outputs K objects, then we fix |Zn| = K − |Yn|, sothat |Ωn| = K . It is now possible to meaningfully computethe difference between Φ(xn,Y ) and Φ(xn,Ωn), where thelatter is defined as

Φ(xn,Ωn

) =∑y∈Yn

φ(xn, y

) +∑z∈Zn

φ(xn, z

). (10)

The importance of this step shall become clear in Sect. 3.1,Eq. (15). Note that we still define �(Y,Y n) only in terms ofthe training labels Yn, as in Eq. (3).

Following the programme of Yu and Joachims (2009),learning proceeds by alternately optimizing the latent vari-ables and the parameter vector. Optimizing the parame-ter vector θi given the latent variables Zi is addressed inSect. 3.1; optimizing the latent variables Zi given the pa-rameter vector θi−1 is addressed in Sect. 3.2.

3 The Optimization Problem

The optimization problem of Eq. (1) is non-convex. Morecritically, the loss is a piecewise constant function of θ .2

A similar problem occurs when one aims to optimize a0/1 loss in binary classification; in that case, a typicalworkaround consists of minimizing a surrogate convex lossfunction that upper-bounds the 0/1 loss, such as the hingeloss, which gives rise to support vector machines. We willnow see that we can construct a suitable convex relaxationfor the problem defined in Eq. (1).

3.1 Convex Relaxation

Here we use an analogous approach to that of SVMs, no-tably popularized in Tsochantaridis et al. (2005), which opti-mizes a convex upper bound on the structured loss of Eq. (1).The resulting optimization problem is

[θ∗, ξ∗] = argmin

θ,ξ

[1

N

N∑n=1

ξn + λ

2‖θ‖2

](11a)

s.t.⟨Φ

(xn,Ωn

), θ

⟩ − ⟨Φ

(xn,Y

), θ

⟩ ≥ �(Y,Y n

) − ξn

∀n,Y ∈ Y . (11b)

It is easy to see that ξ∗n upper-bounds �(Y n,Y n) (and there-

fore the objective in Eqs. (11a), (11b) upper bounds that ofEq. (1) for the optimal solution). First note that since theconstraints of Eq. (11b) hold for all Y , they also hold forY n. Second, the left hand side of the inequality for Y = Y n

must be non-positive since Y (x; θ) = argmaxY 〈Φ(x,Y ), θ〉.It then follows that ξ∗

n ≥ �(Y n,Y n). This implies that a so-lution of the relaxation is an upper bound on the solutionof the original problem, and therefore the relaxation is well-motivated.

The constraints of Eq. (11b) basically enforce a loss-sensitive margin: θ is learned so that mispredictions Y thatincur some loss end up with a score 〈Φ(xn,Y ), θ〉 that issmaller than the score 〈Φ(xn,Ωn), θ〉 of the ‘correct’ pre-diction Ωn by a margin equal to that loss (minus the slackξn). The formulation is a generalization of support vectormachines for multi-class problems.

There are two options for solving the convex relaxationof Eqs. (11a), (11b). One is to explicitly include all N × |Y |constraints and then solve the resulting quadratic programusing one of several existing methods. This may not befeasible if N × |Y | is too large. In this case, we can usea constraint generation strategy. This consists of iterativelysolving the quadratic program by adding at each iteration

2There are countably many values for the loss but uncountably manyvalues for the parameters, so there are large equivalence classes of pa-rameters that correspond to precisely the same loss.

Int J Comput Vis

the constraint corresponding to the most violated Y for thecurrent model θ and training instance n. This is done bymaximizing the violation gap ξn, i.e., solving at each itera-tion the problem

Y(xn; θ) = argmax

Y∈Y

{�

(Y,Y n

) + ⟨Φ

(xn,Y

), θ

⟩}, (12)

(as before we define Y n := Y (xn; θ) for brevity). The solu-tion to this optimization problem (known as ‘column gen-eration’) is somewhat involved, though it turns out to betractable as we shall see in Sect. 3.3.

Several publicly available tools implement precisely thisconstraint generation strategy. A popular example is Svm-Struct (Tsochantaridis et al. 2005), though we use BMRM(‘Bundle Methods for Risk Minimization’; Teo et al. 2007)in light of its faster convergence properties. Algorithm 1 de-scribes pseudocode for solving the optimization problem ofEqs. (11a), (11b) with BMRM. In order to use BMRM, oneneeds to compute at the optimal solution ξ∗

n for the most vi-olated constraint Y n, both the value of the objective functionEqs. (11a), (11b) and its gradient. At the optimal solution forξ∗n with fixed θ we have

⟨Φ

(xn,Ωn

)θ⟩ − ⟨

Φ(xn, Y n

), θ

⟩ = �(Y n, Y n

) − ξ∗n . (13)

(recall that Ωn is the ‘complete’ annotation consisting ofthe union of the groundtruth and the latent variables). Byexpressing Eq. (13) as a function of ξ∗

n and substituting intothe objective function we obtain the following lower boundon the objective of Eq. (11a):

oi = 1

N

∑n

�(Y n, Y n

) − ⟨Φ

(xn,Ωn

), θ

⟩

+ ⟨Φ

(xn, Y n

), θ

⟩ + λ

2‖θ‖2, (14)

whose gradient with respect to θ is

gi = λθ + 1

N

∑n

(Φ

(xn, Y n

) − Φ(xn,Ωn

)). (15)

The method described above could in principle be used forregularizers other than the 2 norm (see Teo et al. 2007),though we require that the model is linear in θ (i.e., it canbe expressed in the form of Eq. (9)), and that Eq. (12) istractable. Extensions of such approaches exist, for examplekernelized variants are discussed in Yu and Joachims (2008).Here we focus on linear models for efficiency reasons—efficient bundle methods cannot be readily applied to solvethe dual problem (see Teo et al. 2007 for details). We re-fer the reader to Tsochantaridis et al. (2005) and Yu andJoachims (2009) for further discussion of the limitations ofthis type of approach.

3.2 Learning the Latent Variables

To learn the optimal value of θ , we alternate between opti-mizing the parameter vector θi given the latent variables Zi ,and optimizing the latent variables Zi given the parametervector θi−1. Given a fixed parameter vector θ , the optimalvalues of the latent variables Zn can be found greedily; do-ing so is in fact equivalent to performing inference, with therestriction that the true labels Yn cannot be part of the latentvariable Zn (see Algorithm 2, Line 5).

It is shown in Yu and Joachims (2009) that this type ofalternating optimization is a specific instance of a convex-concave procedure (‘CCCP’, Yuille and Rangarajan 2002).What this means in practice is that by alternately optimiz-ing θ and Z as in Algorithms 1 and 2, we will arrive ata local optimum of Eq. (1). Naturally this implies that thealgorithm is sensitive to initialization, though as we noted,setting θ = 1 recovers the already good performance of ourinitial binary classifiers, making this an ideal starting pointfor a local search.

See Yu and Joachims (2009) for further discussion of thistype of approach.

Algorithm 1 Taxonomy Learning

1: Input: training set {(xn,Y n,Zn)}Nn=12: Output: θ

3: θ := 0 {in the setting of Algorithm 2, θ can be ‘hot-started’ with its previous value}

4: repeat5: for n ∈ {1 . . .N} do6: Y n := argmaxY∈Y {�(Y,Y n) + 〈φ(xn,Y ), θ〉} (see

Sect. 3.3)7: end for8: Compute gradient gi Eq. (15)9: Compute objective oi Eq. (14)

10: θ := argminθλ2 ‖θ‖2 + max(0,max

j≤i〈gj , θ〉 + oj )

11: until converged (see Teo et al. 2007)12: return θ

Algorithm 2 Taxonomy Learning with Latent Variables

1: Input: training set {(xn,Y n)}Nn=12: Output: θ

3: θ0 := 14: for i = 1 . . . I {I is the total number of iterations} do5: Zn

i := {argmaxY∈Y 〈Φ(xn,Y ), θ i−1〉} \ Yn {chooseonly K − |Yn| distinct labels}

6: θi := Algorithm 1({(xn,Y n,Zni )}Nn=1)

7: end for8: return θI

Int J Comput Vis

3.3 Column Generation

Given the loss function of Eq. (3), obtaining the most vio-lated constraints (Algorithm 1, Line 6) takes the form

Y n = argmaxY∈Y

{miny∈Y

minyn∈Yn

d(y, yn

) +∑y∈Y

⟨φ(xn, y

), θ

⟩}, (16)

which appears to require enumerating through all Y ∈ Y ,which if there are C = |C| classes amounts to

(CK

)possibili-

ties. However, if we had an oracle which told us that

argminy∈Y n

[min

yn∈Ynd(y, yn

)] = c, (17)

then Eq. (12) becomes

Y n = argmaxY∈Y ′

{min

yn∈Ynd(c, yn

) +∑y∈Y

⟨φ(xn, y

), θ

⟩}, (18)

where Y ′ is just Y restricted to those y for which

minyn∈Yn

d(y, yn

) ≥ minyn∈Yn

d(c, yn

). (19)

The important difference between Eq. (16) and Eq. (18) issimply that the part of Eq. (16) involving the loss has beenreplaced by a constant, minyn∈Yn d(c, yn), which we willshow is enough to make Eq. (18) tractable. Of course wehave yet to determine the value of this constant, though weshow below that this can also be done efficiently.3

We can obtain the optimal solution to Eq. (18) greedilyby sorting 〈φ(xn, y), θ〉 for each class y ∈ C such that

minyn∈Yn

d(y, yn

) ≥ minyn∈Yn

d(c, yn

)(20)

and simply choosing the top K classes. Since we don’tknow the optimal value of c in advance, we must considerall c ∈ C , which means solving Eq. (18) a total of C times(recall that C is the number of classes). Solving Eq. (18)greedily takes O(C logC) (to sort C values), so that solvingEq. (12) takes O(C2 logC).

Although this method works for any loss of the formgiven in Eq. (3), for the specific distance function d(y, yn)

used for the ImageNet Challenge, further improvements arepossible. As mentioned, for the ImageNet Challenge’s hier-archical error measure, d(y, yn) is the shortest-path distancefrom yn to the nearest common ancestor of y and yn in ataxonomic tree. One would expect the depth of such a treeto grow logarithmically in the number of classes, and in-deed we find that we always have d(y, yn) ∈ {0 . . .18} (the

3Note that somewhat simpler notation was used in McAuley et al.(2011), in which there was only a single output label yn, but otherwisethe idea remains the same.

trees for the other datasets we study are shallower). Sincethe number of distinct possibilities for �(Y,Yn) is small, in-stead of enumerating each possible value of

c = argminy∈Y n

[min

yn∈Ynd(y, yn

)], (21)

we can directly enumerate each value of

δ =[

miny∈Y n

minyn∈Yn

d(y, yn

)], (22)

i.e., each possible value for the loss �(Y,Y n). If there areL distinct values of the loss, Eq. (12) can now be solvedin O(LC logC). In ImageNet Challenge we have L = 19whereas C = 1000, so this is clearly a significant improve-ment.

Additional minor improvements can be made, for exam-ple we do not need to sort all C values in order to computethe top K items, and we do not need to re-sort all items foreach value of the loss δ. The implementation used in ourexperiments is available online.4

4 Experiments

4.1 Binary Classifiers

As previously described, our approach needs, for each class,one binary classifier able to provide some reasonable scoreas a starting point for the proposed method. Since the ob-jective of this paper is not beating the state-of-the-art, butrather demonstrating the advantage of our structured learn-ing approach to improve overall classification results, weused a standard, simple image classification setup. As men-tioned, should the one-vs-all classifiers of Lin et al. (2011)or Sánchez and Perronnin (2011) become available in thefuture, they should be immediately compatible with the pro-posed method.

First, images have to be transformed into descriptor vec-tors sensible for classification using machine learning tech-niques. For this we have chosen the very popular Bag ofFeatures model (Csurka et al. 2004): dense SIFT features areextracted from each image xn and quantized using a visualvocabulary of F visual words. Next, the visual words arepooled in a histogram that represents the image. This rep-resentation is widely used in state-of-the-art image classifi-cation methods, and in spite of its simplicity achieves verygood results.

Regarding our basic classifiers, a sensible first choice,considering existing related work, would be to use a Lin-ear SVM for every class. However, since our objective is

4See http://i.stanford.edu/~julian.

http://i.stanford.edu/~julian

Int J Comput Vis

to predict the correct class of a new image, we would needto compare the raw scores attained by the classifier, whichwould not be theoretically satisfying. Although it is possibleto obtain probabilities from SVM scores using a sigmoidtrained with the Platt algorithm, we instead opted to train lo-gistic regressors, which directly give probabilities as outputsand do not depend on a separate validation set.

In order to deal with the computational and memory re-quirements derived from the large number of training im-ages, we used Stochastic Gradient Descent (SGD) from Bot-tou and Bousquet (2008) to train the classifiers. SGD is agood choice for our problem, since it has been shown toachieve performance similar to that of batch training meth-ods in a fraction of the time (Perronnin et al. 2010). Further-more, we validated its performance against that of LibLin-ear in a small-scale experiment using part of the ImageNetChallenge hierarchy with satisfactory results. One limitationof online learning methods is that the optimization processiterations are limited by the amount of training data avail-able. In order to add more training data, we cycled over allof the training data for 10 epochs.

For the MIR and ImageCLEF experiments we used SIFTfeatures based on the implementation of Koen et al. (2010).Due to the more manageable size of these datasets, the clas-sifiers were trained using LibLinear.

Using the above approach, the parameters θy

binary for eachclass used in the structured learning methods in the follow-ing sections were generated.

4.2 Structured Classifiers on ImageNet Data

For our first experiment, we consider structured classifica-tion on the ImageNet Challenge dataset. This dataset con-sists of 1.35 million images (1.2 million for training, the restfor testing), each of which is labeled with a single positiveclass.

For every image xn and every class y we must compute〈φ(xn, y), θ〉. Earlier we defined φ(x, y) = x � θ

y

binary. Ifwe have C classes and F features, then this computationcan be made efficient by first computing the (C ×F) matrixA whose yth row is given by (θ

y

binary � θ). Similarly, if wehave N images then the set of image features can be thoughtof as an (N × F) matrix X. Now the energy of a particularlabeling y of xn under θ is given by the matrix product⟨φ(xn, y

), θ

⟩ = (X × AT

)n,y

. (23)

This observation is critical if we wish to handle a large num-ber of images and feature vectors of high-dimension. In ourexperiments, we performed this computation using Nvidia’shigh-performance BLAS library CUBLAS. Although GPUperformance is often limited by a memory bottleneck, thisparticular application is ideally suited as the matrix X is farlarger than either the matrix A, or the resulting product, and

X needs to be copied to the GPU only once, after which itis repeatedly reused. After this matrix product is computed,we must sort every row, which can be naïvely parallelized.

In light of these observations, our method is no longerprohibitively constrained by its running time (running teniterations of Algorithm 2 takes around one day for a singleregularization parameter λ). Instead we are constrained bythe size of the GPU’s onboard memory, meaning that weonly used 25 % of the training data (half for training, halffor validation). To measure the effect of this limitation, wealso trained our model using the entire training set, usinga parallel implementation on a 64 core machine; here weused Intel’s Math Kernel Library to parallelize the matrixmultiplication step.5

The results of our algorithm using features of dimensionF = 1024 and F = 4096 are shown in Figs. 2 and 3, re-spectively. Here we ran Algorithm 2 for ten iterations, ‘hot-starting’ θi using the optimal result from the previous itera-tion. The reduction in training error is also shown for eachiteration of Algorithm 2, showing that minimal benefits aregained after ten iterations. We used regularization parame-ters λ ∈ {10−1,10−2, . . . ,10−8}, and as usual we report thetest error for the value of λ that resulted in the best perfor-mance on the validation set. We show the test error for dif-ferent numbers of nearest-neighbors K , though the methodwas trained to minimize the error for K = 5.

Interestingly, we see negligible benefit when using theentire training set versus using merely 25 %. In light of thefact that a single round of training (for the 4096 dimensionalfeatures) takes approximately six times longer on the CPU(using non-commodity hardware), we advocate use of theGPU implementation. In the experiments that we considerlater, the datasets are sufficiently small that neither memoryrequirements nor running times present a significant issue.

In both Figs. 2 and 3, we find that the optimal θ is non-uniform, indicating that there are interesting relationshipsthat can be learned between the features when a structuredsetting is used. As hoped, a reduction in test error is obtainedover already good classifiers, though the improvement is lesssignificant for the better-performing high-dimensional clas-sifiers.

Ideally, we would like to apply our method to featuresand classifiers like those of Lin et al. (2011) or Sánchez andPerronnin (2011). It remains to be seen whether the settingwe have described could yield additional benefits over theiralready excellent classifiers.

4.3 Multiple Observed Labels

Our model was designed based on the intuition that the la-tent variables should capture those objects that appear in

5http://software.intel.com/en-us/articles/intel-mkl/.

http://software.intel.com/en-us/articles/intel-mkl/

Int J Comput Vis

Fig. 2 Results for training with 1024 dimensional features on Ima-geNet Challenge data. (a) Feature weights; (b) reduction in trainingerror during each iteration of Algorithm 2; (c) error for different num-bers of nearest-neighbors K (the method was trained to optimize the

error for K = 5). Results are reported for the best value of λ on thevalidation set (here λ = 10−4). Due to the large size of the datasets inquestion, standard errors are �0 for all datapoints

Fig. 3 Results for training with 4096 dimensional features on Ima-geNet Challenge data. (a) Feature weights; (b) reduction in trainingerror during each iteration of Algorithm 2; (c) error for different num-

bers of nearest-neighbors K (the method was trained to optimize theerror for K = 5). Results are reported for the best value of λ on thevalidation set (here λ = 10−6)

an image, but are not present in the groundtruth. However,as we observe only a single (positive) label for each im-age in ImageNet Challenge, we were unable to test this hy-pothesis on that data. In this experiment, we study two tax-

onomies for which a complete labeling is provided for eachimage. By training using only a fraction of the groundtruthlabels, this allows us to examine the extent to which thelatent variables align with those categories withheld from

Int J Comput Vis

Fig. 4 The MIR taxonomy.Twenty-four of the tree’s nodesare the possible labels for eachimage

Fig. 5 A partial view of the ImageCLEF taxonomy. The possible labels for each image are the ninety-nine leaves of the tree

the groundtruth. These datasets also allow us to measure thebenefit obtained by our latent variable model as the labelingbecomes ‘stronger’, i.e., as a more complete groundtruth isprovided.

In the MIR Flickr Retrieval Evaluation (MIR) dataset,25,000 images are labeled into 24 object categories (Huiskeset al. 2010). Although the classifiers discussed in Huiskeset al. (2010) are concerned with optimizing average pre-cision, the object categories are organized into topics andsubtopics, which can be treated as a taxonomy. The taxon-omy we derive is shown in Fig. 4 (see Huiskes et al. 2010,Table 2). Although the small number of categories in thisdataset does not demand the use of a hierarchical loss (thelabels are sufficiently distinct that the 0/1 loss is sufficient),

it is valuable as a first step in identifying the function of thelatent variables in our model.

The ImageCLEF dataset uses a subset of 18,000 imagesfrom the MIR dataset (Nowak and Huiskes 2010). The im-ages are categorized into 99 concepts, forming a richer anddeeper hierarchy than that of the MIR dataset. The taxon-omy for the ImageCLEF dataset is shown in Fig. 5.

In both of these datasets, images may have any numberof labels from the label vocabulary, including none (the MIRand ImageCLEF datasets have up to 14 and 26 labels perimage, respectively). To analyze the role of the latent vari-ables, we randomly select a fixed number of (up to) G labelsfor each image. These G labels form our training annota-tions Yn. Note that when G = 1 we recover precisely the

Int J Comput Vis

Fig. 6 Improvement of learning over non-learning for the MIR andImageCLEF taxonomies as the number of groundtruth labels G in-creases. ‘Testing error’ refers to the average loss across all images inthe test set. The plots are annotated to show the percentage improve-

ment of learning over non-learning. All results are reported for the bestvalue of λ on the validation set. Note that the final two bars for the MIRdataset are almost identical simply due to the fact that few images haveas many as four labels owing to the small size of the label vocabulary

Fig. 7 Example results onsome images from ImageCLEF.Recall that as the ‘correct’labels we randomly choose asingle label from the completegroundtruth. Note that commonclasses such as ‘no blur’ arepredicted more frequently in thehierarchical model, and thatsemantically similar labels (suchas ‘happy’ and ‘funny’, or‘shadow’ and ‘night’), are rarelypredicted together

setting used in the experiment of Sect. 4.2 for the ImageNetChallenge data.

Results for learning on both the MIR and the ImageCLEFdatasets for different values of G are shown in Fig. 6. Twoaspects of these results are interesting at first glance: firstly,the improvement of learning over non-learning is the mostsignificant in the MIR taxonomy, in spite of our previouscomment that a loss derived from this taxonomy, being theshallowest, most closely resembles the simpler 0/1 loss. Infact, on further inspection we discover that among all of ourexperiments, the largest improvements are achieved in the

smallest taxonomies. However, it should be noted that theloss of Eq. (3) still differs from the 0/1 loss due to the factthat we predict multiple labels simultaneously, so this find-ing merely reveals that our model is better able to leveragethe structured nature of the loss when the label vocabularyis smaller.

Figure 7 shows the results on some example images fromImageCLEF, for G = 1 (i.e., we randomly choose a sin-gle label from the complete groundtruth for training). Somecommon patterns emerge which explain how our latent vari-able model is able to leverage the structured nature of the

Int J Comput Vis

loss function: Firstly, higher scores are given to extremelycommon classes such as ‘no blur’, which are rarely givenhigh scores by the original binary classifiers. Secondly, theoriginal classifiers often predict semantically similar labels(such as ‘happy’ and ‘funny’, or ‘shadow’ and ‘night’),whereas more semantically distinct classes are chosen afterlearning.

Secondly, we note that as G increases (meaning that thegroundtruth becomes more complete), the relative improve-ment of learning over non-learning increases in almost allcases (naturally the absolute value of the loss decreases forlarger G, as the problem becomes easier due to the ‘min’ inEq. (3)). This is an interesting result, since for images withfully-labeled groundtruth, the latent variables no longer playany readily interpretable role. Again we note that no matterwhat the values of the latent variables, we still gain a consid-erable advantage in all cases simply due to the fact that weare optimizing the correct loss. Ultimately, what this doesimply is that while the 0/1 loss (or some surrogate such asthe hinge loss) appears to be a reasonable proxy for the lossof Eq. (3) when G = 1, the importance of optimizing thecorrect loss becomes more important when it is a functionof multiple groundtruth labels.

In Fig. 8 we assess whether the predicted values of the la-tent variables Zn match those groundtruth annotations thatwere withheld from Yn. If we have C classes and G′ =min(G, |Yn|) groundtruth labels (i.e., we have G labels ex-cept when the complete annotation Yn has fewer than G la-bels), then the number of ways of predicting the remainingK − P labels is(

C − G′K − G′

). (24)

The number of ways that c of them can be ‘correct’ (i.e.,they appear in the withheld groundtruth) is(

K − G′c

)︸︷︷︸

correct predictions

×(

C − G′ − c

K − G′ − c

)︸︷︷︸incorrect predictions

. (25)

Thus the expected number of correct predictions made by arandom classifier is∑K−G′

c=1 c ×(

K−G′c

)(K−G′

c

)×

(C−G′−c

K−G′−c

)(

C−G′K−G′

) . (26)

During each iteration of Algorithm 2, we measure the frac-tion of withheld groundtruth labels that appear in the la-tent variable Zn for each image, normalized by Eq. (26).When G = 1 (as is the case in ImageNet Challenge), wesee that for both the MIR and ImageCLEF datasets, the la-tent variables gradually align with the unannotated objects,as we had expected. However, on the MIR dataset, for all

G > 1, we see that after an initial period of alignment, the la-tent variables gradually become less similar to the withheldgroundtruth; apparently the model is better able to lever-age the latent variables by assigning them some role otherthan matching the withheld groundtruth. On the ImageCLEFdataset, for G < 4, the latent variables actually agree withthe groundtruth less than would be expected of a randomclassifier, implying that the latent variables play some otherrole altogether.

Although it is indeed somewhat difficult to interpret thecomplex dynamics of the upper 8 plots in Fig. 8, an intu-itive picture emerges after a sufficient number of iterations.The two plots in the bottom of Fig. 8 show at the tenth iter-ation, the agreement between the latent variables and with-held groundtruth increases with G (the number of traininglabels). We observe a monotonic improvement, which indi-cates that the more labels we know, the better we can predictthe missing labels. This is intuitive in the sense that it agreeswith the fact that we use a structured loss that accounts fordependencies between the predicted labels.

4.4 Image Tagging in a Taxonomy

Also provided in the MIR dataset are the Flickr tags foreach image. The loss function of Eq. (3) is a natural onefor the problem of tag recommendation, as one is satisfiedso long as some reasonable tags are suggested to the user. Itis also natural to penalize ‘incorrect’ tags using a taxonomy,as the space of all possible tags is too large for the 0/1 lossto be practical (in the 25,000 images in the MIR dataset,there are around 20,000 unique tags). Such a dataset dif-fers from those of Sects. 4.2 and 4.3: while the groundtruthmay contain multiple labels per image (as with MIR andImageCLEF), it may still be incomplete (as with ImageNetChallenge), in the sense that we are never certain whether animage couldn’t have been assigned a particular tag.

The problem of automatically deriving a taxonomy fromimage tags is studied in Setia and Burkhardt (2007), thoughfor the current experiment it is simpler to manually assignthe most commonly used Flickr tags to existing concepts inWordNet (Miller 1995).

Among the 200 most popular tags in the MIR dataset,152 correspond to readily identifiable concepts in WordNet(of the other 48, many are camera brands or non-Englishwords). Having identified corresponding concepts in Word-Net, we can define our loss function much as is done forthe ImageNet Challenge, i.e., by taking the shortest pathdistance in the WordNet taxonomy from the correct tagto the nearest common ancestor of the predicted and thecorrect tag. On average, each of the 25,000 images con-tains 1.44 of these 152 tags. A selection of the tags, andtheir relationships via homonymns in WordNet are shown inFig. 9.

Int J Comput Vis

Fig. 8 The above plots measure how closely the latent variables matchthose labels withheld from the groundtruth during training. The y-axismeasures the fraction of withheld groundtruth labels that appear in thelatent indicator Zi (normalized by the expected number of correct pre-dictions made by a random classifier). The eight upper plots show thisquantity as a function of the iteration of Algorithm 2, while the two

bottom plots show this quantity for iteration 10, as a function of thenumber of known labels G. The two bottom plots reveal that the levelof agreement between the predicted latent variables and the withheldgroundtruth increases monotonically in G, which is expected since ourstructured loss models dependencies between the predicted labels

Fig. 9 A partial view of the taxonomy defined on MIR tags. The nodeson the bottom level are (the concepts corresponding to) the tags ob-served in the MIR dataset, while the nodes on the top level are the

common ancestors via which they are related using our loss function(i.e., each edge represents a path in the WordNet hierarchy, which mayhave length greater than one)

Int J Comput Vis

Results for learning on the MIR tag taxonomy are shownin Figs. 10 and 11. In Fig. 10 we use the experimentalsetup from Sect. 4.3, i.e., we withhold some fraction ofthe groundtruth labels so as to analyze the effect of train-ing with ‘stronger’ groundtruth information; results are sim-ilar to those reported in Sect. 4.3. In Fig. 11 we use theexperimental setup from Sect. 4.2, i.e., we simply use allof the available training evidence. Here we predict K = 10nearest-neighbors, which in practice could represent sug-gesting ten tags to a user in a tag-recommendation system.Learning achieves an improvement over non-learning of ap-proximately 12 % when predicting ten tags, a more substan-

Fig. 10 Improvement of learning over non-learning for the taxonomydefined over MIR tags. All results are reported for the best value of λ

on the validation set

tial improvement than what we observed for ImageNet Chal-lenge.

4.5 Optimization of Flat Losses

In order to optimize the performance measure used in theImageNet Challenge, we had to account for both the hier-archical nature of the loss, as well as the fact that multiplepredictions can be made simultaneously. Although we havedemonstrated the benefit of optimizing such performancemeasures directly, it is unclear to which of these two aspectsthe improvement owes. One way to test this is to see whetherwe can achieve similar gains using our latent variable modelusing ‘flat’ (i.e., 0/1) losses.

Recall that the error measure used to evaluate perfor-mance in the ImageNet Challenge took the form

�(Y,

{yn

}) = miny∈Y

d(y, yn

), (27)

where d(y, yn) was some difference measure between a pre-dicted label y and the groundtruth label yn. So far we haveassumed that d(y, yn) measured the shortest-path distancefrom yn to the nearest common ancestor of y and yn in ataxonomic tree, corresponding to the ImageNet Challenge’s

Fig. 11 Results for training with 1024 dimensional features on MIRtag data. (a) Feature weights; (b) reduction in training error duringeach iteration of Algorithm 2; (c) error for different numbers ofnearest-neighbors K (the method was trained to optimize the errorfor K = 10). Results are reported for the best value of λ on the val-

idation set (here λ = 10−5). Note that although Algorithm 2 shouldproduce a monotonic decrease in training error (see Yuille and Ran-garajan 2002), we occasionally observe increase in training error whenAlgorithm 1 fails to converge

Int J Comput Vis

Fig. 12 Improvement of learning over non-learning using the Ima-geNet Challenge’s ‘flat’ evaluation measure on all three datasets. Allresults are reported for the best value of λ on the validation set

‘hierarchical’ performance measure. The ImageNet Chal-lenge’s ‘flat’ performance measure replaces this by

d(y, yn

) = 1 − δ(y = yn

)(28)

(i.e., 0 if y = yn, 1 otherwise). In the context of Eq. (27) thismeans that a loss of 0 is achieved so long as the groundtruthlabel yn appears in Y , and 1 otherwise.

Optimizing the ImageNet Challenge’s ‘flat’ performancemeasure using our latent variable model simply means re-placing the hierarchical difference measure used in the pre-vious experiments by that of Eq. (28). Results on the Im-ageNet, CLEF, and MIR datasets, using 1024 dimensionalfeatures are shown in Fig. 12. The results are similar (interms of percentage improvement) to what we reported inSects. 4.2 and 4.3 (except for ImageNet, where the improve-ment is smaller). Again the error is smallest on the MIRdataset simply due to the fact that it has the smallest labelvocabulary.

To study the converse problem of optimizing the hierar-chical performance measure without introducing latent vari-ables, we use the hierarchical difference measure from theprevious experiments, but optimize the performance of onlya single prediction. As in the previous experiments, we eval-uate the method using the hierarchical performance measurefrom ImageNet. We found that doing so did not improveover the non-learning performance (i.e., the performance ofthe original binary predictors). This is not surprising, sincewe are no longer optimizing the same performance measureon which the method is evaluated.

It should be noted that Cai and Hofmann (2004) success-fully reported the benefit of directly optimizing hierarchi-cal performance measures for multiclass document classi-fication problems, and Binder et al. (2011) reported simi-larly promising results for problems from computer vision.However, their experiments differ critically from what wepresent here due to the fact that their datasets have a smalllabel vocabulary and typically only a single plausible labelper document (meaning that they can directly evaluate theiralgorithms on a single prediction).

4.6 Single-Stage Learning

In the case of ImageNet, we made the argument that it is notfeasible to learn parameters for all classes simultanetouslyin a single unified stage, due to the high-dimensionality ofthe parameters involved (e.g. learning all parameters for theclassifiers in Sect. 4.2 would require optimizing millions ofparameters simultaneously). On the other hand, MIR andImageCLEF have only 24 and 99 categories (respectively),so that learning all parameters simultaneously ought to befeasible, at least with respect to running time.

Recall that in Eq. (7) we assumed a classifier of the form


∑y∈Y

⟨x � θ

y

binary, θ⟩, (29)

where θy

binary was assumed to be given as input from a pre-vious learning stage, so that our algorithm only had to learnthe weighting factor θ . A single stage approach replaces thisclassifier by one of the form

Y ′(x; θ) = argmaxY∈Y

∑y∈Y

⟨x, θ

y

binary

⟩, (30)

where each θy

binary is a parameter vector to be learned. Notethat this model is linear in the concatenation of all modelparameters, (θ1

binary, . . . , θCbinary), meaning that it is amenable

to the same structured learning approaches we described inSect. 3.

In principle the model of Eq. (30) is a generalization ofthe original model of Eq. (7), though we found that trainingthe model of Eq. (30) led to inferior performance on bothMIR and ImageCLEF data. There are several possible expla-nations for this phenomenon: Firstly, given that the numberof parameters exceeds the number of training images, over-fitting is certainly a problem that may be alleviated by a two-stage approach; even for ImageNet, where there are over onemillion images for training, this problem would persist dueto the high number of classes. Secondly, in a two-stage ap-proach the base classifiers θ

y

binary are trained using methodsthat differ significantly from the max-margin objective ofEq. (11a).

In terms of running time the model of Eq. (30) was alsoinferior to that of Eq. (7), leading to an approximately 20-fold running time increase in the case of MIR, and a 50-foldincrease in the case of ImageCLEF. The same experiment onthe ImageNet Challenge data proved impractical in terms ofrunning time.

5 Discussion

The proposed model leads to a reduction in error for all ofthe hierarchical labeling problems we considered in Sect. 4.

Int J Comput Vis

The fact that we achieve the largest benefits in the small-est hierarchies is possibly explained by the findings of Denget al. (2009), who show that in large hierarchies, confusiontends to occur between semantically similar classes, evenwhen optimizing a 0/1 loss. Alternately, for small hierar-chies the classes are more semantically distinct, so that opti-mizing the hierarchical loss changes the predictions signifi-cantly. The fact that we achieve the largest benefits when wehave additional groundtruth labels simply reflects the factthat the loss we are optimizing is structured and takes intoaccount label dependencies through the hierarchy.

It is interesting to discover that the smallest benefit isobtained when we have a large label vocabulary and onlya single groundtruth label, as is the case for the ImageNetChallenge. This may be an indication that the hierarchicalinformation is not informative in such cases, so that the 0/1loss becomes a good proxy for the hierarchical loss in ques-tion. The fact that the best performing methods from the Im-ageNet Challenge competition in terms of the 0/1 loss werealso the best performing in terms of the hierarchical loss pro-vides weak evidence for this assertion. Alternately, it maysimply reflect the fact that this is the version of the problemthat has received the most attention, so that the state-of-the-art binary classifiers are sufficiently accurate as to be able toovercome the weakness of using the incorrect loss.

Image tags derived from image hosting websites likeFlickr are a natural source of weakly-labeled data from largevocabularies, where hierarchical losses are sensible sincemany tags corresponding to semantically similar conceptscan be used interchangeably. Indeed we discovered that mostof the commonly used tags in Flickr correspond to readilyindentifiable concepts from WordNet, obviating the need tolearn a hierarchy directly from the data. However, an obvi-ous danger of using a hierarchial loss is that certain typesof tags, while semantically similar, are not interchangable.For instance, while ‘San Francisco’ and ‘New York City’are visually and semantically close (in terms of the Word-Net hierarchy), a tag recommendation system that suggestsone in place of the other would not be useful. Optimizingfor error measures that are sensitive to this fact remains anavenue for future work.

One key limitation of our formulation is that we applythe same re-weighting for the parameter vectors of all cate-gories. This assumption makes optimization more tractableand leads to an efficient solution. However this is suboptimalin the sense that we would like to have re-weightings thatare to some extent dedicated to different categories, whileat the same time avoiding the need to have multiple inde-pendent re-weighing vectors for each category. This wouldpossibly suggest a strategy that shares parameters across dif-ferent categories, in the spirit of Lampert et al. (2009). Ob-taining a scalable algorithm in this setting however wouldbe a challenge, and is left as future work.

6 Conclusion

Large-scale, collaboratively labeled image datasets embed-ded in a taxonomy naturally invite the use of robust, struc-tured losses, in the sense that they should account for boththe hierarchical nature of the taxonomy and for the incon-sistencies in the labeling process. However, on datasets suchas ImageNet, the state-of-the-art methods still use one-vs-allclassifiers, which do not account for the structured nature ofsuch losses, nor for the imperfect nature of the annotation.We have outlined the computational challenges involved inusing structured methods, shedding some light on why theyhave not been used before in this task. By exploiting a num-ber of computational tricks, and by using recent advances onstructured learning with latent variables, we have been ableto formulate learning in this task as the optimization of aloss that is both structured and robust to weak labeling. Bet-ter yet, our method leverages existing one-vs-all classifiers,essentially by re-weighting, or ‘boosting’ their dimensionsto directly account for the structured nature of the loss. Inpractice this leads to improvements in the hierarchical lossof already good one-vs-all classifiers.

Acknowledgements Part of this work was carried out while bothArnau Ramisa and Tibério Caetano were at INRIA Grenoble, Rhône-Alpes, and while Julian McAuley was at NICTA and the ANU. NICTAis funded by the Australian Government as represented by the Depart-ment of Broadband, Communications and the Digital Economy andthe Australian Research Council through the ICT Centre of Excellenceprogram. The work was partially funded by the QUAERO project sup-ported by OSEO, the French State agency for innovation and MICINNunder the project MIPRCV Consolider Ingenio CSD2007-00018.

References

Bart, E., Porteous, I., Perona, P., & Welling, M. (2008). Unsupervisedlearning of visual taxonomies. In IEEE conference on computervision and pattern recognition.

Berg, A., Deng, J., & Fei-Fei, L. (2010). ImageNet large scalevisual recognition challenge 2010. http://www.image-net.org/challenges/LSVRC/2010/index.

Binder, A., Müller, K.-R., & Kawanabe, M. (2011). On taxonomies formulti-class image categorization. International Journal of Com-puter Vision, 1–21.

Blaschko, M., Vedaldi, A., & Zisserman, A. (2010). Simultaneous ob-ject detection and ranking with weak supervision. In Advances inneural information processing systems.

Bottou, L., & Bousquet, O. (2008). The tradeoffs of large scale learn-ing. In Advances in neural information processing systems.

Bucak, S. S., Jin, R., & Jain, A. K. (2011). Multi-label learning withincomplete class assignments. In IEEE conference on computervision and pattern recognition.

Cai, L., & Hofmann, T. (2004). Hierarchical document categorizationwith support vector machines. In Conference on information andknowledge management.

Csurka, G., Dance, C. R., Fan, L., Willamowski, J., & Bray, C. (2004).Visual categorization with bags of keypoints. In ECCV workshopon statistical learning in computer vision.

http://www.image-net.org/challenges/LSVRC/2010/index

http://www.image-net.org/challenges/LSVRC/2010/index

Int J Comput Vis

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009).ImageNet: a large-scale hierarchical image database. In IEEEconference on computer vision and pattern recognition.

Deng, J., Berg, A. C., Li, K., & Fei-Fei, L. (2010). What does clas-sifying more than 10,000 image categories tell us? In Europeanconference on computer vision.

Deng, J., Berg, A. C., & Fei-Fei, L. (2011). Hierarchical semantic in-dexing for large scale image retrieval. In IEEE conference on com-puter vision and pattern recognition.

Deselaers, T., & Ferrari, V. (2011). Visual and semantic similarity inimagenet. In IEEE conference on computer vision and patternrecognition.

Dimitrovski, I., Kocev, D., Loskovska, S., & Džeroski, S. (2010). De-tection of visual concepts and annotation of images using ensem-bles of trees for hierarchical multi-label classification. In Recog-nizing patterns in signals, speech, images and videos (pp. 152–161).

Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisser-man, A. (2010). The Pascal visual object classes (VOC) challenge.International Journal of Computer Vision, 88(2), 303–338.

Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object cate-gory dataset. Technical report 7694, California Institute of Tech-nology.

Guillaumin, M., Mensink, T., Verbeek, J., & Schmid, C. (2009). Tag-Prop: discriminative metric learning in nearest neighbor modelsfor image auto-annotation. In International conference on com-puter vision.

Guillaumin, M., Verbeek, J., & Schmid, C. (2010). Multimodal semi-supervised learning for image classification. In IEEE conferenceon computer vision and pattern recognition.

Huiskes, M. J., & Lew, M. S. (2008). The MIR Flickr retrieval evalu-ation. In International conference on multimedia information re-trieval.

Huiskes, M. J., Thomee, B., & Lew, M. S. (2010). New trends and ideasin visual concept detection: the MIR Flickr retrieval evaluationinitiative. In International conference on multimedia informationretrieval.

Jégou, H., Douze, M., & Schmid, C. (2010). Product quantization fornearest neighbor search. IEEE Transactions on Pattern Analysisand Machine Intelligence, 33(1), 117–128.

Kawanabe, M., Binder, A., Muller, C., & Wojcikiewicz, W. (2011).Multi-modal visual concept classification of images via Markovrandom walk over tags. In IEEE workshop on applications of com-puter vision.

Kim, B. S., Park, J. Y., Mohan, A., Gilbert, A., & Savarese, S. (2011).Hierarchical classification of images by sparse approximation. InBritish machine vision conference.

Lampert, C. H., Nickisch, H., & Harmeling, S. (2009). Learning todetect unseen object classes by Between-class attribute transfer.

Lavrenko, V., Manmatha, R., & Jeon, J. (2003). A model for learn-ing the semantics of pictures. In Advances in neural informationprocessing systems.

Lin, Y., Lv, F., Zhu, S., Yang, M., Cour, T., & Yu, K. (2011). Large-scale image classification: fast feature extraction and SVM train-ing. In IEEE conference on computer vision and pattern recogni-tion.

Marszałek, M., & Schmid, C. (2008). Constructing category hierar-chies for visual recognition. In European conference in computervision.

McAuley, J., Ramisa, A., & Caetano, T. (2011). Optimization of robustloss functions for weakly-labeled image taxonomies: an ImageNet

case study. In Energy minimization methods in computer visionand pattern recognition.

Mensink, T., Verbeek, J., & Csurka, G. (2011). Learning structured pre-diction models for interactive image labeling. In IEEE conferenceon computer vision and pattern recognition.

Miller, G. A. (1995). WordNet: A lexical database for English. Com-munications of the ACM, 38, 39–41.

Moran, S., & Lavrenko, V. (2011). Optimal tag sets for automatic im-age annotation. In British machine vision conference.

Nowak, S., & Huiskes, M. J. (2010). New strategies for image anno-tation: overview of the photo annotation task at ImageCLEF. InCLEF (notebook Papers/LABs/Workshops).

Nowak, S., Nagel, K., & Liebetrau, J. (2011). The CLEF 2011 photoannotation and concept-based retrieval tasks. Working Notes ofCLEF.

Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the Fisherkernel for large-scale image classification. In European confer-ence on computer vision.

Russakovsky, O., & Fei-Fei, L. (2010). Attribute learning in large-scaledatasets. In ECCV workshop on parts and attributes.

Sánchez, J., & Perronnin, F. (2011). High-Dimensional signature com-pression for Large-Scale image classification. In IEEE conferenceon computer vision and pattern recognition.

Setia, L., & Burkhardt, H. (2007). Learning taxonomies in large imagedatabases. In ACM SIGIR workshop on multimedia informationretrieval.

Shi, Q., Petterson, J., Dror, G., Langford, J., Smola, A., Strehl, A., &Vishwanathan, V. (2009). Hash kernels. In Artificial intelligenceand statistics.

Sivic, J., Russell, B. C., Zisserman, A., Freeman, W. T., & Efros, A. A.(2008). Unsupervised discovery of visual object class hierarchies.In IEEE conference on computer vision and pattern recognition.

Teo, C. H., Smola, A., Vishwanathan, S. V. N., & Le, Q. V. (2007). Ascalable modular convex solver for regularized risk minimization.In Knowledge discovery and data mining.

Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny im-ages: a large data set for nonparametric object and scene recogni-tion. IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 30(11), 1958–1970.

Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005).Large margin methods for structured and interdependent outputvariables. Journal of Machine Learning Research, 6, 1453–1484.

van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2010).Evaluating color descriptors for object and scene recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence,32(9), 1582–1596.

Verbeek, J., Guillaumin, M., Mensink, T., & Schmid, C. (2010). Imageannotation with TagProp on the MIR flickr set. In Internationalconference on multimedia information retrieval.

Wang, H., Huang, H., & Ding, C. (2011). Image annotation using bi-relational graph of images and semantic labels. In IEEE confer-ence on computer vision and pattern recognition.

Yu, C.-N., & Joachims, T. (2008). Training structural svms with kernelsusing sampled cuts. In Knowledge discovery and data mining.

Yu, C.-N., & Joachims, T. (2009). Learning structural SVMs with la-tent variables. In International conference on machine learning.

Yuille, A., & Rangarajan, A. (2002). The concave-convex procedure(CCCP). In Advances in neural information processing systems.

optimization of robust loss functions for weakly-labeled

Documents