accounting for the relative importance of objects in image retrieval

ACCOUNTING FOR THE RELATIVE IMPORTANCE OF OB-JECTS IN IMAGE RETRIEVAL

Sung Ju Hwang and Kristen GraumanUniversity of Texas at Austin

Image retrieval

Query image

Image Database

Image 1

Image 2

Image k

Content-based retrieval from an image database…

Relative importance of objects

Query image

Image Database

Which image is more relevant to the query?

?


Query imagecow

bird

water

cow

birdwater

Image Database

cow

fence

mud

Which image is more relevant to the query?

?

sky


An image can contain many different objects,

but some are more “impor-tant” than oth-ers.

sky

water

mountain

architecture

bird

cow


Some objects are background

sky

water

mountain

architecture

bird

cow


Some objects are less salient

sky

water

mountain

architecture

bird

cow


Some objects are more promi-nent or percep-tually define the scene

sky

water

mountain

architecture

bird

cow

Our goal

Goal: Retrieve those images that share important ob-jects with the query image.

versus

How to learn a representation that accounts for this?

The order in which person assigns tags provides implicit cues about object importance to scene.

Idea: image tags as importance cue

TAGSCowBirdsArchitectureWaterSky

TAGS:

CowBirdsArchitectureWaterSky

Idea: image tags as importance cue

Learn this connection to improve cross-modal retrieval and CBIR.

The order in which person assigns tags provides implicit cues about object importance to scene.

Related work

Previous work using tagged images focuses on the noun ↔ object correspondence.

Duygulu et al. 02 Fergus et al. 05 Li et al., 09Berg et al. 04

Lavrenko et al. 2003, Monay & Gatica-Perez 2003, Barnard et al. 2004, Schroff et al. 2007, Gupta & Davis 2008, …

Related work building richer image representations from “two-view” text+image data:

Bekkerman & Jeon 07, Qi et al. 09, Quack et al. 08, Quattoni et al 07, Yakhnenko & Honavar 09,…

Gupta et al. 08

height: 6-11 weight: 235 lbs position:forward, croatia college:

Blaschko & Lampert 08Hardoon et al. 04

Approach overview:Building the image database

Extract visual and tag-based

features

CowGrass

HorseGrass

CarHouseGrassSky

Learn projections from each feature

space into common “semantic space”

Tagged training images

…

CowTree

Retrieved tag-list

• Image-to-image retrieval• Image-to-tag auto annotation• Tag-to-image retrieval

Approach overview:Retrieval from the database

Untagged query image

CowTreeGrass

Tag list query

Imagedatabase

Retrieved im-ages

Dual-view semantic space

Visual features and tag-lists are two views generated by the same concept.

Semantic space

Learning mappings to semantic spaceCanonical Correlation Analysis (CCA): choose pro-jection directions that maximize the correlation of views projected from same instance.

Semantic space: new common feature space

View 1View 2

Kernel Canonical Correlation Analysis

Linear CCA Given paired data:

Select directions so as to maximize:

Same objective, but projections in kernel space:

,

Kernel CCA Given pair of kernel functions:

,

[Akaho 2001, Fyfe et al. 2001, Hardoon et al. 2004]

Semantic space

Building the kernels for each view

Word frequency,rank kernels

Visual kernels

Visual features

captures the HSV color distribution

captures the total scene structure

captures local ap-pearance (k-means on DoG+SIFT)

Color Histogram Visual WordsGist

[Torralba et al.]

Average the component χ2 kernels to build a sin-gle visual kernel .

http://ilab.usc.edu/wiki/index.php/Image:HSV_Color_Cone.png

Tag features

Traditional bag-of-(text)wordsWord Frequency

CowBirdWaterArchitectureMountainSky

tag countCow 1Bird 1Water 1Architecture 1Mountain 1Sky 1Car 0Person 0

Tag features

Absolute Rank


Absolute rank in this image’s tag-list

tag valueCow 1Bird 0.63Water 0.50Architecture 0.43Mountain 0.39Sky 0.36Car 0Person 0

Tag features

Relative Rank


Percentile rank obtained from the rank distribution of that word in all tag-lists. tag value

Cow 0.9Bird 0.6Water 0.8Architecture 0.5Mountain 0.8Sky 0.8Car 0Person 0

Average the component χ2 kernels to build a sin-gle tag kernel .

Recap: Building the image database

Semantic space

Visual feature space tag feature space

Experiments

We compare the retrieval performance of our method with two baselines:

Query image

1st retrieved image

Visual-Only Baseline

Query im-age

1st retrieved image

Words+Visual Baseline

[Hardoon et al. 2004, Yakhenenko et al. 2009]

KCCA seman-tic space

We use Normalized Discounted Cumulative Gain at top K (NDCG@K) to evaluate retrieval performance:

Evaluation

Doing well in the top ranks is more important.

Sum of all the scores for the perfect ranking(normalization)

Reward termscore for pth ranked example

[Kekalainen & Jarvelin, 2002]

We present the NDCG@k score using two different re-ward terms:

Evaluation

scale presence relative rank

absolute rank

Object presence/scale Ordered tag similarity

CowTreeGrass

PersonCowTreeFenceGrass

Rewards similarity of query’s ob-jects/scales and those in re-trieved image(s).

Rewards similarity of query’s ground truth tag ranks and those in retrieved image(s).

Dataset

LabelMe

6352 images Database: 3799 images Query: 2553 images

Scene-oriented Contains the ordered

tag lists via labels added

56 unique taggers ~23 tags/image

Pascal

9963 images Database: 5011 images Query: 4952 images

Object-central Tag lists obtained on

Mechanical Turk 758 unique taggers ~5.5 tags/image

Imagedatabase

Image-to-image retrieval

We want to retrieve images most similar to the given query image in terms of object importance.

Tag-list kernel spaceVisual kernel space

Untagged query image

Retrieved images

Our method

Words +

Visual

Visual only

Image-to-image retrieval results

Query Image


Our method

Words +

Visual

Visual only

Query Image


Our method better retrieves images that share the query’s important objects, by both measures.

Retrieval accuracymeasured by object+scale similarity

Retrieval accuracymeasured by ordered tag-list similarity

39% improvement

Tag-to-image retrieval

We want to retrieve the images that are best described by the given tag list

Imagedatabase


Query tags

CowPersonTreeGrassRetrieved images

Tag-to-image retrieval results

Our method better respects the importance cues implied by the user’s keyword query.

31% improvement

Image-to-tag auto annotation

We want to annotate query image with ordered tags that best describe the scene.

Imagedatabase


Untagged query image Output tag-lists

CowTreeGrass

CowGrass

FieldCowFence

Image-to-tag auto annotation results

BoatPersonWaterSkyRock

BottleKnifeNapkinLightfork

PersonTreeCarChairWindow

TreeBoatGrassWaterPerson

Method k=1 k=3 k=5 k=10

Visual-only 0.0826 0.1765 0.2022 0.2095

Word+Visual 0.0818 0.1712 0.1992 0.2097

Ours 0.0901 0.1936 0.2230 0.2335

k = number of nearest neighbors used

WomanTableMugLadder

Implicit tag cues as localization prior

MugKeyKeyboardTooth-brushPenPhotoPost-it

Object de-tector

Implicit tag features

ComputerPosterDeskScreenMugPoster

Training: Learn object-specific connection between localization parameters and implicit tag features.

MugEiffel

DeskMugOffice

MugCoffee

Testing: Given novel image, localize objects based on both tags and appearance.

P (location, scale | tags)

Implicit tag features

[Hwang & Grauman, CVPR 2010]

Conclusion

• We want to learn what is implied (beyond objects present) by how a human provides tags for an im-age

• Approach requires minimal supervision to learn the connection between importance conveyed by tags and visual features.

• Consistent gains over• content-based visual search • tag+visual approach that disregards importance

THANK YOU

accounting for the relative importance of objects in image retrieval

Documents

image retrieval image

image tags accounting

object importance

richer image representations

image kcontentbased

important objects

image database extract

view text image data