visual sense disambiguation: a multimodal approach phd thesis by kate saenko computer science and ai...

50
Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor: Trevor Darrell

Upload: ryleigh-motley

Post on 19-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

Visual Sense Disambiguation: A Multimodal Approach

PhD thesis by Kate Saenko

Computer Science and AI Lab

Massachusetts Institute of Technology

Advisor: Trevor Darrell

Page 2: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 2

The challenge of large scale object recognition

• How to get examples of 10,000+ categories?– Collection of training images is time-

consuming, subjective– But the Web has billions of images!

• How to build precise models based on unlabeled image data?

• How to learn visual models on the fly, based on user input?

Page 3: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 3

Multimodal context

speech text

image

collective knowledge

Page 4: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 4

Main Contributions

• Proposed a method that combines images and spoken utterances to learn object models

• Developed an unsupervised approach that learns visual models from unlabeled images, text, and dictionaries

This is a bag…

… The Tote is the perfect example of two handbag design principles that ... The lines of this tote are incredibly sleek, but ... The semi buckles that form the handle attachments are ...

Page 5: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 5

This is a bag

Page 6: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 6

Noun• bag, container (a flexible container with a single opening)• bag, handbag, pocketbook, purse (a container used for carrying money

and small personal items or accessories (especially by women))• bag, travelling bag, suitcase (a rectangular container for carrying clothes)

bag

bag

bag

bag

… The Tote is the perfect example of two handbag design principles that ... The lines of this tote are incredibly sleek, but ... The semi buckles that form the handle attachments are ...

Page 7: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 7

Outline

• Audio-visual object recognition– Related work– Fusion model and experiments*

• Unsupervised text and image models– Related work– WISDOM: probabilistic dictionary-based image sense model– Concrete WISDOM: identifying tangible objects

*see Saenko and Darrell. Object category recognition using probabilistic fusion of speech and image classifiers. MLMI, 2007.

Page 8: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 8

Audio-visual object recognition

speech text

image

dictionary

Page 9: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 9

Task: object recognition with audio-visual input*

Speechrecognizer

Speech DB

*e.g. BIRON robot, see S. Li and B.Wrede. “Why and how to model multi-modal interaction for a mobile robot companion," In Proc. AAAI, 2007.

lamp lamplamplamp

label <lamp>+

Page 10: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 10

Speech, image can be ambiguous…

a pan...

That’s a pen!

Copy machine..

ant → fan

face → bass

piano → cannon

Page 11: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 11

Proposal: use both channels to help disambiguate underlying word

objectrecognizer

Page 12: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 12

Fusion of speech and image classifiers

ObjectClassifier

Speechrecognizer

Speech DB

ImageClassifier

Image DB

lamp

• Improve existing method by using both modalities

• Explore late fusion of classifier outputs

– Mean rule– Product rule

Page 13: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 13

Experiments with 101 objects

• Asked users to speak object name for Caltech101, added noise• Plot shows benefit from fusion across noise levels

Page 14: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 14

Remaining issues…

objectrecognizer

Page 15: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 15

Unsupervised object models

speech text

image

dictionary

Page 16: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 16

Next

• Audio-visual object recognition– Related work– Fusion model and experiments

• Unsupervised text and image models– Related work– WISDOM: probabilistic dictionary-based image sense model– Concrete WISDOM: identifying tangible objects

Page 17: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 17

How can we learn a rich variety of visual concepts?

Page 18: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 18

Image Sense Disambiguation

Would rather watch… Suicide watch

Hurricane, tornado watch

Watch out!

Celebrity watch

Page 19: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 19

Text contexts

icrystal rfid wrist watch features watch masterpiece innovative watch making craftsmanship absolute precision fine charm high scratch resistance anti-

allergenic characteristics make chronometer true jewel s wrist water proof sleek stylish wrist watch solar powered available watch ticket key purse identity card special offer place order rfid wrist watch absolutely free rfid watch black

wrist strap rfid watch orange wrist strap rfid watch stainless steel privacy

disclaimer copyright icrystal pty website

Page 20: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 20

Topic 1rolex

service repair battery omega replica

tag heuer breitling swiss

replace gucci button price band

Topic 2new world media right said

house april

obama islam march bush war

american time

Latent Dirichlet allocation (LDA) (Blei et al. ‘03)

• One of several techniques for discovering latent dimensions in bag-of-words data

α θ

z

w

β

φ

K M

Nd

d

word

P(w|z)

topic

document

P(z|d)

Page 21: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 21

Latent Topics

icrystal rfid wrist watch features watch masterpiece innovative watch making craftsmanship absolute precision fine

charm high scratch resistance anti-allergenic characteristics make

chronometer true jewel wrist water proof sleek stylish wrist watch solar powered available watch ticket key

purse identity card special offer place order rfid wrist watch absolutely free rfid watch black wrist strap rfid watch

orange wrist strap rfid watch stainless steel privacy disclaimer copyright

icrystal pty website

Page 22: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 22

Overview of approaches to web-based object model learning

• Some learn only from image features– (Li et al.07) bootstrap from labeled images– (Fergus et al.05) select correct image topic

• Some incorporate text features– (Schroff et al.07) use a category-independent text classifier– (Berg and Forsyth 06) ask user to sort text topics

• None address polysemy directly– (Loeff et al.06) do image sense discrimination, not

identification

• All rely on labeled images of correct sense

Page 23: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 23

Next

• Audio-visual object recognition– Related work– Fusion model and experiments

• Unsupervised text and image models– Related work– WISDOM: probabilistic dictionary-based image sense

model*– Concrete WISDOM: identifying tangible objects

*see Saenko and Darrell. Unsupervised learning of visual sense models for polysemous words. NIPS 2008.

Page 24: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 24

How can we ground image senses in the absence of labeled examples?

WORDNET: Noun•S: (n) watch, ticker (a small portable timepiece) •S: (n) watch (a period of time (4 or 2 hours) during which some of a ship's crew are on duty) •S: (n) watch, vigil (a purposeful surveillance to guard or observe) •S: (n) watch (the period during which someone (especially a guard) is on duty) •S: (n) lookout, lookout man, sentinel, sentry, watch, spotter, scout, picket (a person employed to keep watch for some anticipated event) •S: (n) vigil, watch (the rite of staying awake for devotional purposes (especially on the eve of a religious festival))

WIKIPEDIA: Watch may also refer to:•Watch system, a period of work duty •Tropical cyclone warnings and watches, alerts issued to coastal areas threatened by severe storms •Watch (Unix), a Unix command •Watch (TV channel) a TV station launching in Autumn 2008 •Watch (computer programming) •Help:Watching pages on Wikipedia •Watch (dog), name of the pet dog in the the Boxcar Children

D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. ACL, 1995.

Page 25: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 25

Sense-specific classifier

training images

Web Image Sense DictiOnary Model

Search Engine WatchSearch Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search ...searchenginewatch.com/ - 38k - Cached - Similar pages - Note thiswatch - MDCWatches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in ...developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/Object/watch - 30k - Cached - Similar pages - Note this

Search Engine WatchSearch Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search ...searchenginewatch.com/ - 38k - Cached - Similar pages - Note thiswatch - MDCWatches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in ...developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/Object/watch - 30k - Cached - Similar pages - Note this

dictionary definitions

unlabeled text

dictionary model P( sense | data)

WISDOM does:1. image sense

disambiguation

2. dataset collection

3. classification of unseen images

noun

web images

fosil wrist watch a800 x 628 - 107k - jpg

amgmedia.com

watch-1(ticker)

Page 26: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 26

WISDOM: Using dictionary entries to ground senses

• Use entry text to learn a probability distribution over words for that sense

• Problem: entries contain very little text– Expand by adding synonyms, example sentences, etc.– Still, very few words are covered!

•S: (n) mouse (any of numerous small rodents typically resembling diminutive rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails) •S: (n) mouse (any of numerous small rodents typically resembling diminutive rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails) •direct hyponym / full hyponym

•S: (n) house mouse, Mus musculus (brownish-grey Old World mouse now a common household pest worldwide) •S: (n) harvest mouse, Micromyx minutus (small reddish-brown Eurasian mouse inhabiting e.g. cornfields) •S: (n) field mouse, fieldmouse (any nocturnal Old World mouse of the genus Apodemus inhabiting woods and fields and gardens) •S: (n) nude mouse (a mouse with a genetic defect that prevents them from growing hair and also prevents them from immunologically rejecting human cells and tissues; widely used in preclinical trials) •S: (n) wood mouse (any of various New World woodland mice)

•direct hypernym / inherited hypernym / sister term •S: (n) rodent, gnawer (relatively small placental mammals having a single pair of constantly growing incisor teeth specialized for gnawing)

Page 27: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 27

WISDOM: Probabilistic dictionary-based model

• Main idea:

– Using LDA, learn latent sense-like dimensions on a large amount of related text,

– Model dictionary senses in LDA space:

• Map image contexts to topics• Map topics to senses

Search Engine WatchSearch Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search ...searchenginewatch.com/ - 38k - Cached - Similar pages - Note thiswatch - MDCWatches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in ...developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/Object/watch - 30k - Cached - Similar pages - Note this

Search Engine WatchSearch Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search ...searchenginewatch.com/ - 38k - Cached - Similar pages - Note thiswatch - MDCWatches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in ...developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/Object/watch - 30k - Cached - Similar pages - Note this

unlabeled text

LDA

Page 28: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 28

WISDOM sense model

• Given a query word with sense s with values in set {1,…,S}, and a text document d, the probability of sense is

d

z

N

s

• Define the likelihood of topic z given sense s with entry words es= w1,…,wEs as

• To compute probability of sense given topic

Page 29: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 29

WISDOM: Incorporating Image Features

• Use LDA to discover visual topics v=1,…,L,

• Then estimate the conditional probability P(s|v)

• Given a test image di*, we can compute

• Combine contributions of image and text:

Page 30: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 30

WISDOM classifier

SVM classifier

training images

Search Engine WatchSearch Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search ...searchenginewatch.com/ - 38k - Cached - Similar pages - Note thiswatch - MDCWatches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in ...developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/Object/watch - 30k - Cached - Similar pages - Note this

Search Engine WatchSearch Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search ...searchenginewatch.com/ - 38k - Cached - Similar pages - Note thiswatch - MDCWatches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in ...developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/Object/watch - 30k - Cached - Similar pages - Note this

dictionary definitions

unlabeled text

dictionary model P( sense | data)

noun

web images

fosil wrist watch a800 x 628 - 107k - jpg

amgmedia.com

watch-1(ticker)

Page 31: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 31

Evaluation datasets

• Collected by querying Image Search – MIT-ISD: bass, face, mouse, speaker, watch

– MIT-OFFICE: cellphone, fork, hammer, keyboard, mug, pliers, scissors, stapler, telephone, watch

– UIUC-ISD: bass, crane, squash

core relatedcore relatedunrelated ???

Page 32: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 32

Experimental Setup

1. Task: Image sense disambiguation (ISD) in search results– Separate images according to visual sense

– “core” labels are positive class, “related” and “unrelated” negative

– Metrics: true positives vs. false positives (ROC), recall-precision curve (RPC)

2. Task: object classification in a novel image– Classify image as having correct object category or not

– “core” labels are positive class, other keyword’s “core” senses are negative class

– Metric: percent correct

Page 33: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 33

ISD example results

squash: sports

squash: vegetable

bass: musical instrument

bass: fish

bass: raw web image data

squash: raw web image data

Page 34: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 34

yahoo

musical range

polyph. range

male singer

sea bass

freshwater bass

basso, voice

instrument

spiny fish

yahoo

musical range

polyph. range

male singer

sea bass

freshwater bass

basso, voice

instrument

spiny fish

ISD Results: ROC using each WordNet sense for BASS

BASSTr

ue p

ositi

ve ra

te

False positive rate

Page 35: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 35

ISD Results: RPC using true sense

yahoo wisdom

Retrieval of core senses on UIUC-ISD

Page 36: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 36

Results: object classification

• Baseline approach: – Automatically generate sense-specific keywords from WordNet– Append word to synonyms and direct hypernyms– Limit queries to 3 terms– Example: mouse + computer, mouse + electronic device

• Plot shows average accuracy across five objects in the MIT-ISD dataset (each is a two-class problem with chance performance of 50%)

85%

75%

65%

55%

50 100 150 200 250 300

Number of training images

baselinewisdom

Accu

racy

Page 37: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 37

Next

• Audio-visual object recognition– Related work– Fusion model and experiments

• Unsupervised text and image models– Related work– WISDOM: probabilistic dictionary-based image sense model– Concrete WISDOM: identifying tangible objects*

*see Saenko and Darrell, Filtering Abstract Senses From Image Search Results, NIPS 2009.

Page 38: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 38

Query Word: “cup”

Online DictionaryWord to search for:Noun

Online DictionaryWord to search for:Noun

cup Search Dictionary

• cup (a small open container usually used for drinking; usually has a handle) "he put the cup back in the saucer"; "the handle of the cup was missing"

• cup, loving cup (a large metal vessel with two handles that is awarded as a trophy to the winner of a competition) "the school kept the cups is a special glass case”

• a major sporting event or competition “the world cup”, “the Stanley cup”

Concrete WISDOM

Object Sense: drinking container

Abstract Sense: sporting event

Object Sense: loving cup (trophy)

Removing Abstract Senses

Page 39: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 39

mouse

rodent

beaver

mammal

cow…

How can we identify abstract senses?

Mouse: Noun•<noun.animal>S: (n) mouse (any of numerous small rodents typically resembling diminutive rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails) •<noun.state>S: (n) shiner, black eye, mouse (a swollen bruise caused by a blow to the eye) •<noun.person>S: (n) mouse (person who is quiet or timid) •<noun.artifact>S: (n) mouse, computer mouse (a hand-operated electronic device that controls the coordinates of a cursor on your computer screen…)

• Idea: use the ontological information available via WordNet– semantic relations between concepts (hypernym, part, etc.)– lexical tags:

Page 40: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 40

Experimental Setup

Table: Concrete Senses Identified by WISDOM

• Task: ISD using concrete-sense WISDOM– all “core” and “related” labels of keyword are positive class,

“unrelated” labels are negative class

Page 41: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 41

Results: Filtering visual senses

Yahoo Search: “telephone”DICTIONARY

1: (n) telephone, phone, telephone set (electronic equipment that converts sound into electrical signals that can be transmitted over distances and then converts received signals back into sounds)

2: (n) telephone, telephony (transmitting speech at a distance)

Page 42: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 42

Results: Filtering visual senses

Artifact sense: “telephone”DICTIONARY

1: (n) telephone, phone, telephone set (electronic equipment that converts sound into electrical signals that can be transmitted over distances and then converts received signals back into sounds)

2: (n) telephone, telephony (transmitting speech at a distance)

Page 43: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 43

Results: RPC of all concrete senses

Retrieval of core+related concrete senses on UIUC-ISD

yahoo wisdom

Page 44: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 44

Further Improvement: Topic adaptation

• Original LDA topics are learned on text-only unlabeled data• Adapt to image-text data via semi-supervised Gibbs sampling• E.g.: one of “fork” topics:

product bike null tool tube seal set price oil

knife spoon spring ship use item accessory

handle shop order remove store custom

home weight steel supply cap clamp fit false

. . .

cutlery knife spoon product set price handle

steel tool item

stainless null bike

tube seal oil knive

kitchen utensil ship

order use table sp ring

supply design piece

carve weight shop . . .

Page 45: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 45

“fork”: using original topics

unrelated:fork lift

road forkbike fork

knife…

Page 46: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 46

“fork”: using adapted topics

unrelated:fork lift

road forkbike fork

knife…

Page 47: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 47

Results on MIT-OFFICE

• The average area under the RPC improves from 0.47 to 0.57• Detailed RPCs:

yahoo wisdom

Page 48: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 48

Conclusions

• Showed that combining speech with image input may be advantageous for object recognition

• Presented WISDOM, an unsupervised method to learn sense-specific object models from images and text harvested from the web

• Extended WISDOM to filter out non-physical word senses based on WordNet semantic structure

Page 49: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 49

Future work: WISDOM-enabled interactive training

speech text

image

dictionary

supervised classifierWISDOM

Page 50: Visual Sense Disambiguation: A Multimodal Approach PhD thesis by Kate Saenko Computer Science and AI Lab Massachusetts Institute of Technology Advisor:

PhD defense by Kate Saenko 50