visual sense disambiguation: a multimodal approach phd thesis by kate saenko computer science and ai...

Post on 19-Jan-2016

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Visual Sense Disambiguation: A Multimodal Approach

PhD thesis by Kate Saenko

Computer Science and AI Lab

Massachusetts Institute of Technology

Advisor: Trevor Darrell

PhD defense by Kate Saenko 2

The challenge of large scale object recognition

• How to get examples of 10,000+ categories?– Collection of training images is time-

consuming, subjective– But the Web has billions of images!

• How to build precise models based on unlabeled image data?

• How to learn visual models on the fly, based on user input?

PhD defense by Kate Saenko 3

Multimodal context

speech text

image

collective knowledge

PhD defense by Kate Saenko 4

Main Contributions

• Proposed a method that combines images and spoken utterances to learn object models

• Developed an unsupervised approach that learns visual models from unlabeled images, text, and dictionaries

This is a bag…

… The Tote is the perfect example of two handbag design principles that ... The lines of this tote are incredibly sleek, but ... The semi buckles that form the handle attachments are ...

PhD defense by Kate Saenko 5

This is a bag

PhD defense by Kate Saenko 6

Noun• bag, container (a flexible container with a single opening)• bag, handbag, pocketbook, purse (a container used for carrying money

and small personal items or accessories (especially by women))• bag, travelling bag, suitcase (a rectangular container for carrying clothes)

bag

bag

bag

bag

… The Tote is the perfect example of two handbag design principles that ... The lines of this tote are incredibly sleek, but ... The semi buckles that form the handle attachments are ...

PhD defense by Kate Saenko 7

Outline

• Audio-visual object recognition– Related work– Fusion model and experiments*

• Unsupervised text and image models– Related work– WISDOM: probabilistic dictionary-based image sense model– Concrete WISDOM: identifying tangible objects

*see Saenko and Darrell. Object category recognition using probabilistic fusion of speech and image classifiers. MLMI, 2007.

PhD defense by Kate Saenko 8

Audio-visual object recognition

speech text

image

dictionary

PhD defense by Kate Saenko 9

Task: object recognition with audio-visual input*

Speechrecognizer

Speech DB

*e.g. BIRON robot, see S. Li and B.Wrede. “Why and how to model multi-modal interaction for a mobile robot companion," In Proc. AAAI, 2007.

lamp lamplamplamp

label <lamp>+

PhD defense by Kate Saenko 10

Speech, image can be ambiguous…

a pan...

That’s a pen!

Copy machine..

ant → fan

face → bass

piano → cannon

PhD defense by Kate Saenko 11

Proposal: use both channels to help disambiguate underlying word

objectrecognizer

PhD defense by Kate Saenko 12

Fusion of speech and image classifiers

ObjectClassifier

Speechrecognizer

Speech DB

ImageClassifier

Image DB

lamp

• Improve existing method by using both modalities

• Explore late fusion of classifier outputs

– Mean rule– Product rule

PhD defense by Kate Saenko 13

Experiments with 101 objects

• Asked users to speak object name for Caltech101, added noise• Plot shows benefit from fusion across noise levels

PhD defense by Kate Saenko 14

Remaining issues…

objectrecognizer

PhD defense by Kate Saenko 15

Unsupervised object models

speech text

image

dictionary

PhD defense by Kate Saenko 16

Next

• Audio-visual object recognition– Related work– Fusion model and experiments

• Unsupervised text and image models– Related work– WISDOM: probabilistic dictionary-based image sense model– Concrete WISDOM: identifying tangible objects

PhD defense by Kate Saenko 17

How can we learn a rich variety of visual concepts?

PhD defense by Kate Saenko 18

Image Sense Disambiguation

Would rather watch… Suicide watch

Hurricane, tornado watch

Watch out!

Celebrity watch

PhD defense by Kate Saenko 19

Text contexts

icrystal rfid wrist watch features watch masterpiece innovative watch making craftsmanship absolute precision fine charm high scratch resistance anti-

allergenic characteristics make chronometer true jewel s wrist water proof sleek stylish wrist watch solar powered available watch ticket key purse identity card special offer place order rfid wrist watch absolutely free rfid watch black

wrist strap rfid watch orange wrist strap rfid watch stainless steel privacy

disclaimer copyright icrystal pty website

PhD defense by Kate Saenko 20

Topic 1rolex

service repair battery omega replica

tag heuer breitling swiss

replace gucci button price band

Topic 2new world media right said

house april

obama islam march bush war

american time

Latent Dirichlet allocation (LDA) (Blei et al. ‘03)

• One of several techniques for discovering latent dimensions in bag-of-words data

α θ

z

w

β

φ

K M

Nd

d

word

P(w|z)

topic

document

P(z|d)

PhD defense by Kate Saenko 21

Latent Topics

icrystal rfid wrist watch features watch masterpiece innovative watch making craftsmanship absolute precision fine

charm high scratch resistance anti-allergenic characteristics make

chronometer true jewel wrist water proof sleek stylish wrist watch solar powered available watch ticket key

purse identity card special offer place order rfid wrist watch absolutely free rfid watch black wrist strap rfid watch

orange wrist strap rfid watch stainless steel privacy disclaimer copyright

icrystal pty website

PhD defense by Kate Saenko 22

Overview of approaches to web-based object model learning

• Some learn only from image features– (Li et al.07) bootstrap from labeled images– (Fergus et al.05) select correct image topic

• Some incorporate text features– (Schroff et al.07) use a category-independent text classifier– (Berg and Forsyth 06) ask user to sort text topics

• None address polysemy directly– (Loeff et al.06) do image sense discrimination, not

identification

• All rely on labeled images of correct sense

PhD defense by Kate Saenko 23

Next

• Audio-visual object recognition– Related work– Fusion model and experiments

• Unsupervised text and image models– Related work– WISDOM: probabilistic dictionary-based image sense

model*– Concrete WISDOM: identifying tangible objects

*see Saenko and Darrell. Unsupervised learning of visual sense models for polysemous words. NIPS 2008.

PhD defense by Kate Saenko 24

How can we ground image senses in the absence of labeled examples?

WORDNET: Noun•S: (n) watch, ticker (a small portable timepiece) •S: (n) watch (a period of time (4 or 2 hours) during which some of a ship's crew are on duty) •S: (n) watch, vigil (a purposeful surveillance to guard or observe) •S: (n) watch (the period during which someone (especially a guard) is on duty) •S: (n) lookout, lookout man, sentinel, sentry, watch, spotter, scout, picket (a person employed to keep watch for some anticipated event) •S: (n) vigil, watch (the rite of staying awake for devotional purposes (especially on the eve of a religious festival))

WIKIPEDIA: Watch may also refer to:•Watch system, a period of work duty •Tropical cyclone warnings and watches, alerts issued to coastal areas threatened by severe storms •Watch (Unix), a Unix command •Watch (TV channel) a TV station launching in Autumn 2008 •Watch (computer programming) •Help:Watching pages on Wikipedia •Watch (dog), name of the pet dog in the the Boxcar Children

D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. ACL, 1995.

PhD defense by Kate Saenko 25

Sense-specific classifier

training images

Web Image Sense DictiOnary Model

Search Engine WatchSearch Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search ...searchenginewatch.com/ - 38k - Cached - Similar pages - Note thiswatch - MDCWatches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in ...developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/Object/watch - 30k - Cached - Similar pages - Note this

Search Engine WatchSearch Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search ...searchenginewatch.com/ - 38k - Cached - Similar pages - Note thiswatch - MDCWatches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in ...developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/Object/watch - 30k - Cached - Similar pages - Note this

dictionary definitions

unlabeled text

dictionary model P( sense | data)

WISDOM does:1. image sense

disambiguation

2. dataset collection

3. classification of unseen images

noun

web images

fosil wrist watch a800 x 628 - 107k - jpg

amgmedia.com

watch-1(ticker)

PhD defense by Kate Saenko 26

WISDOM: Using dictionary entries to ground senses

• Use entry text to learn a probability distribution over words for that sense

• Problem: entries contain very little text– Expand by adding synonyms, example sentences, etc.– Still, very few words are covered!

•S: (n) mouse (any of numerous small rodents typically resembling diminutive rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails) •S: (n) mouse (any of numerous small rodents typically resembling diminutive rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails) •direct hyponym / full hyponym

•S: (n) house mouse, Mus musculus (brownish-grey Old World mouse now a common household pest worldwide) •S: (n) harvest mouse, Micromyx minutus (small reddish-brown Eurasian mouse inhabiting e.g. cornfields) •S: (n) field mouse, fieldmouse (any nocturnal Old World mouse of the genus Apodemus inhabiting woods and fields and gardens) •S: (n) nude mouse (a mouse with a genetic defect that prevents them from growing hair and also prevents them from immunologically rejecting human cells and tissues; widely used in preclinical trials) •S: (n) wood mouse (any of various New World woodland mice)

•direct hypernym / inherited hypernym / sister term •S: (n) rodent, gnawer (relatively small placental mammals having a single pair of constantly growing incisor teeth specialized for gnawing)

PhD defense by Kate Saenko 27

WISDOM: Probabilistic dictionary-based model

• Main idea:

– Using LDA, learn latent sense-like dimensions on a large amount of related text,

– Model dictionary senses in LDA space:

• Map image contexts to topics• Map topics to senses

Search Engine WatchSearch Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search ...searchenginewatch.com/ - 38k - Cached - Similar pages - Note thiswatch - MDCWatches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in ...developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/Object/watch - 30k - Cached - Similar pages - Note this

Search Engine WatchSearch Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search ...searchenginewatch.com/ - 38k - Cached - Similar pages - Note thiswatch - MDCWatches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in ...developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/Object/watch - 30k - Cached - Similar pages - Note this

unlabeled text

LDA

PhD defense by Kate Saenko 28

WISDOM sense model

• Given a query word with sense s with values in set {1,…,S}, and a text document d, the probability of sense is

d

z

N

s

• Define the likelihood of topic z given sense s with entry words es= w1,…,wEs as

• To compute probability of sense given topic

PhD defense by Kate Saenko 29

WISDOM: Incorporating Image Features

• Use LDA to discover visual topics v=1,…,L,

• Then estimate the conditional probability P(s|v)

• Given a test image di*, we can compute

• Combine contributions of image and text:

PhD defense by Kate Saenko 30

WISDOM classifier

SVM classifier

training images

Search Engine WatchSearch Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search ...searchenginewatch.com/ - 38k - Cached - Similar pages - Note thiswatch - MDCWatches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in ...developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/Object/watch - 30k - Cached - Similar pages - Note this

Search Engine WatchSearch Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search ...searchenginewatch.com/ - 38k - Cached - Similar pages - Note thiswatch - MDCWatches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in ...developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/Object/watch - 30k - Cached - Similar pages - Note this

dictionary definitions

unlabeled text

dictionary model P( sense | data)

noun

web images

fosil wrist watch a800 x 628 - 107k - jpg

amgmedia.com

watch-1(ticker)

PhD defense by Kate Saenko 31

Evaluation datasets

• Collected by querying Image Search – MIT-ISD: bass, face, mouse, speaker, watch

– MIT-OFFICE: cellphone, fork, hammer, keyboard, mug, pliers, scissors, stapler, telephone, watch

– UIUC-ISD: bass, crane, squash

core relatedcore relatedunrelated ???

PhD defense by Kate Saenko 32

Experimental Setup

1. Task: Image sense disambiguation (ISD) in search results– Separate images according to visual sense

– “core” labels are positive class, “related” and “unrelated” negative

– Metrics: true positives vs. false positives (ROC), recall-precision curve (RPC)

2. Task: object classification in a novel image– Classify image as having correct object category or not

– “core” labels are positive class, other keyword’s “core” senses are negative class

– Metric: percent correct

PhD defense by Kate Saenko 33

ISD example results

squash: sports

squash: vegetable

bass: musical instrument

bass: fish

bass: raw web image data

squash: raw web image data

PhD defense by Kate Saenko 34

yahoo

musical range

polyph. range

male singer

sea bass

freshwater bass

basso, voice

instrument

spiny fish

yahoo

musical range

polyph. range

male singer

sea bass

freshwater bass

basso, voice

instrument

spiny fish

ISD Results: ROC using each WordNet sense for BASS

BASSTr

ue p

ositi

ve ra

te

False positive rate

PhD defense by Kate Saenko 35

ISD Results: RPC using true sense

yahoo wisdom

Retrieval of core senses on UIUC-ISD

PhD defense by Kate Saenko 36

Results: object classification

• Baseline approach: – Automatically generate sense-specific keywords from WordNet– Append word to synonyms and direct hypernyms– Limit queries to 3 terms– Example: mouse + computer, mouse + electronic device

• Plot shows average accuracy across five objects in the MIT-ISD dataset (each is a two-class problem with chance performance of 50%)

85%

75%

65%

55%

50 100 150 200 250 300

Number of training images

baselinewisdom

Accu

racy

PhD defense by Kate Saenko 37

Next

• Audio-visual object recognition– Related work– Fusion model and experiments

• Unsupervised text and image models– Related work– WISDOM: probabilistic dictionary-based image sense model– Concrete WISDOM: identifying tangible objects*

*see Saenko and Darrell, Filtering Abstract Senses From Image Search Results, NIPS 2009.

PhD defense by Kate Saenko 38

Query Word: “cup”

Online DictionaryWord to search for:Noun

Online DictionaryWord to search for:Noun

cup Search Dictionary

• cup (a small open container usually used for drinking; usually has a handle) "he put the cup back in the saucer"; "the handle of the cup was missing"

• cup, loving cup (a large metal vessel with two handles that is awarded as a trophy to the winner of a competition) "the school kept the cups is a special glass case”

• a major sporting event or competition “the world cup”, “the Stanley cup”

Concrete WISDOM

Object Sense: drinking container

Abstract Sense: sporting event

Object Sense: loving cup (trophy)

Removing Abstract Senses

PhD defense by Kate Saenko 39

mouse

rodent

beaver

mammal

cow…

How can we identify abstract senses?

Mouse: Noun•<noun.animal>S: (n) mouse (any of numerous small rodents typically resembling diminutive rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails) •<noun.state>S: (n) shiner, black eye, mouse (a swollen bruise caused by a blow to the eye) •<noun.person>S: (n) mouse (person who is quiet or timid) •<noun.artifact>S: (n) mouse, computer mouse (a hand-operated electronic device that controls the coordinates of a cursor on your computer screen…)

• Idea: use the ontological information available via WordNet– semantic relations between concepts (hypernym, part, etc.)– lexical tags:

PhD defense by Kate Saenko 40

Experimental Setup

Table: Concrete Senses Identified by WISDOM

• Task: ISD using concrete-sense WISDOM– all “core” and “related” labels of keyword are positive class,

“unrelated” labels are negative class

PhD defense by Kate Saenko 41

Results: Filtering visual senses

Yahoo Search: “telephone”DICTIONARY

1: (n) telephone, phone, telephone set (electronic equipment that converts sound into electrical signals that can be transmitted over distances and then converts received signals back into sounds)

2: (n) telephone, telephony (transmitting speech at a distance)

PhD defense by Kate Saenko 42

Results: Filtering visual senses

Artifact sense: “telephone”DICTIONARY

1: (n) telephone, phone, telephone set (electronic equipment that converts sound into electrical signals that can be transmitted over distances and then converts received signals back into sounds)

2: (n) telephone, telephony (transmitting speech at a distance)

PhD defense by Kate Saenko 43

Results: RPC of all concrete senses

Retrieval of core+related concrete senses on UIUC-ISD

yahoo wisdom

PhD defense by Kate Saenko 44

Further Improvement: Topic adaptation

• Original LDA topics are learned on text-only unlabeled data• Adapt to image-text data via semi-supervised Gibbs sampling• E.g.: one of “fork” topics:

product bike null tool tube seal set price oil

knife spoon spring ship use item accessory

handle shop order remove store custom

home weight steel supply cap clamp fit false

. . .

cutlery knife spoon product set price handle

steel tool item

stainless null bike

tube seal oil knive

kitchen utensil ship

order use table sp ring

supply design piece

carve weight shop . . .

PhD defense by Kate Saenko 45

“fork”: using original topics

unrelated:fork lift

road forkbike fork

knife…

PhD defense by Kate Saenko 46

“fork”: using adapted topics

unrelated:fork lift

road forkbike fork

knife…

PhD defense by Kate Saenko 47

Results on MIT-OFFICE

• The average area under the RPC improves from 0.47 to 0.57• Detailed RPCs:

yahoo wisdom

PhD defense by Kate Saenko 48

Conclusions

• Showed that combining speech with image input may be advantageous for object recognition

• Presented WISDOM, an unsupervised method to learn sense-specific object models from images and text harvested from the web

• Extended WISDOM to filter out non-physical word senses based on WordNet semantic structure

PhD defense by Kate Saenko 49

Future work: WISDOM-enabled interactive training

speech text

image

dictionary

supervised classifierWISDOM

PhD defense by Kate Saenko 50

top related