visual sense disambiguation: a multimodal approach phd thesis by kate saenko computer science and ai...

Visual Sense Disambiguation: A Multimodal Approach

PhD thesis by Kate Saenko

Computer Science and AI Lab

Massachusetts Institute of Technology

Advisor: Trevor Darrell

PhD defense by Kate Saenko 2

The challenge of large scale object recognition

• How to get examples of 10,000+ categories?– Collection of training images is time-

consuming, subjective– But the Web has billions of images!

• How to build precise models based on unlabeled image data?

• How to learn visual models on the fly, based on user input?


Multimodal context

speech text

image

collective knowledge


Main Contributions

• Proposed a method that combines images and spoken utterances to learn object models

• Developed an unsupervised approach that learns visual models from unlabeled images, text, and dictionaries

This is a bag…

… The Tote is the perfect example of two handbag design principles that ... The lines of this tote are incredibly sleek, but ... The semi buckles that form the handle attachments are ...


This is a bag


Noun• bag, container (a flexible container with a single opening)• bag, handbag, pocketbook, purse (a container used for carrying money

and small personal items or accessories (especially by women))• bag, travelling bag, suitcase (a rectangular container for carrying clothes)

bag

bag

bag

bag

… The Tote is the perfect example of two handbag design principles that ... The lines of this tote are incredibly sleek, but ... The semi buckles that form the handle attachments are ...


Outline

• Audio-visual object recognition– Related work– Fusion model and experiments*

• Unsupervised text and image models– Related work– WISDOM: probabilistic dictionary-based image sense model– Concrete WISDOM: identifying tangible objects

*see Saenko and Darrell. Object category recognition using probabilistic fusion of speech and image classifiers. MLMI, 2007.


Audio-visual object recognition

speech text

image

dictionary


Task: object recognition with audio-visual input*

Speechrecognizer

Speech DB

*e.g. BIRON robot, see S. Li and B.Wrede. “Why and how to model multi-modal interaction for a mobile robot companion," In Proc. AAAI, 2007.

lamp lamplamplamp

label <lamp>+


Speech, image can be ambiguous…

a pan...

That’s a pen!

Copy machine..

ant → fan

face → bass

piano → cannon


Proposal: use both channels to help disambiguate underlying word

objectrecognizer


Fusion of speech and image classifiers

ObjectClassifier

Speechrecognizer

Speech DB

ImageClassifier

Image DB

lamp

• Improve existing method by using both modalities

• Explore late fusion of classifier outputs

– Mean rule– Product rule


Experiments with 101 objects

• Asked users to speak object name for Caltech101, added noise• Plot shows benefit from fusion across noise levels


Remaining issues…

objectrecognizer


Unsupervised object models

speech text

image

dictionary


Next

• Audio-visual object recognition– Related work– Fusion model and experiments

• Unsupervised text and image models– Related work– WISDOM: probabilistic dictionary-based image sense model– Concrete WISDOM: identifying tangible objects


How can we learn a rich variety of visual concepts?


Image Sense Disambiguation

Would rather watch… Suicide watch

Hurricane, tornado watch

Watch out!

Celebrity watch

http://images.google.com/imgres?imgurl=http://www.edgetechcorp.com/Repository/ProductImages/0/USB-Watch-Drive-Steel-Dress.gif&imgrefurl=http://www.edgetechcorp.com/usb-flash-drives/diskgo-combos-watch-dress.asp&h=800&w=450&sz=100&hl=en&start=1&sig2=H5huHKMOuA9RUpZHk2XEsg&um=1&tbnid=Ai6hzUulJB7PWM:&tbnh=143&tbnw=80&ei=hMzuRpauCpSoeJWI3c0G&prev=/images%3Fq%3Dwatch%26svnum%3D10%26um%3D1%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DG

http://images.google.com/imgres?imgurl=http://www.amgmedia.com/freephotos/fosil-watch.jpg&imgrefurl=http://www.amgmedia.com/freephotos/free-photos-4.html&h=628&w=800&sz=107&hl=en&start=2&sig2=Ic0u-uQTORkTfKoGr1ecdQ&um=1&tbnid=-NK3_bsKKNnf_M:&tbnh=112&tbnw=143&ei=hMzuRpauCpSoeJWI3c0G&prev=/images%3Fq%3Dwatch%26svnum%3D10%26um%3D1%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DG

http://images.google.com/imgres?imgurl=http://www.twainquotes.com/FredoniaWatchFace.jpg&imgrefurl=http://www.twainquotes.com/FredoniaWatch.html&h=374&w=500&sz=46&hl=en&start=3&sig2=gqs84BGB8cxg7nrjYrJLwg&um=1&tbnid=Hc_h_mMUUEfMYM:&tbnh=97&tbnw=130&ei=hMzuRpauCpSoeJWI3c0G&prev=/images%3Fq%3Dwatch%26svnum%3D10%26um%3D1%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DG

http://images.google.com/imgres?imgurl=http://www.mrgadget.com.au/catalog/images/mrgadget_1gb_usb_watch.gif&imgrefurl=http://www.mrgadget.com.au/catalog/index.php&h=333&w=333&sz=35&hl=en&start=4&sig2=5TnOccLf_ntD9TjsxOgC1A&um=1&tbnid=T4ZUFNlZy-oKIM:&tbnh=119&tbnw=119&ei=hMzuRpauCpSoeJWI3c0G&prev=/images%3Fq%3Dwatch%26svnum%3D10%26um%3D1%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DG

http://images.google.com/imgres?imgurl=http://www.jcrs.com/newsletters/2005/images_march/SENG-Ruby-Watch.jpg&imgrefurl=http://www.jcrs.com/newsletters/2005/2005_03.htm&h=510&w=700&sz=77&hl=en&start=5&sig2=wD6_DuBdfnPzyqVw7TZsBw&um=1&tbnid=WAuF4GLUIHacSM:&tbnh=102&tbnw=140&ei=hMzuRpauCpSoeJWI3c0G&prev=/images%3Fq%3Dwatch%26svnum%3D10%26um%3D1%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DG

http://images.google.com/imgres?imgurl=http://www.cabanonpress.com/images/tomsbits/watch.jpg&imgrefurl=http://www.cabanonpress.com/tomsshed/8.2.watch.htm&h=434&w=402&sz=37&hl=en&start=6&sig2=lpRNHm1Nht8eS6KMGpB_yA&um=1&tbnid=rHPWzGlc3AvM6M:&tbnh=126&tbnw=117&ei=hMzuRpauCpSoeJWI3c0G&prev=/images%3Fq%3Dwatch%26svnum%3D10%26um%3D1%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DG

http://images.google.com/imgres?imgurl=http://www.clock-desktop.com/screens/sky_watch/heaven-watch.jpg&imgrefurl=http://nishugoyal.wordpress.com/tag/writing/fiction/&h=1184&w=1579&sz=207&hl=en&start=7&sig2=LD9P7mcChCnhkpyN5UV-Fw&um=1&tbnid=TEQJHeUXUH_zMM:&tbnh=112&tbnw=150&ei=hMzuRpauCpSoeJWI3c0G&prev=/images%3Fq%3Dwatch%26svnum%3D10%26um%3D1%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DG

http://images.google.com/imgres?imgurl=http://www.wristdreams.com/images/equalizer-High-Freq-2-watch-silver.jpg&imgrefurl=http://www.wristdreams.com/archives/2006/05/&h=373&w=300&sz=19&hl=en&start=8&sig2=SoOg2BVVKbrNs28ho2V2Dg&um=1&tbnid=5OuSYmsaANoeSM:&tbnh=122&tbnw=98&ei=hMzuRpauCpSoeJWI3c0G&prev=/images%3Fq%3Dwatch%26svnum%3D10%26um%3D1%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DG

http://images.google.com/imgres?imgurl=http://www.the-gadgeteer.com/assets/almeda-multi-alarm-watch2.jpg&imgrefurl=http://the-gadgeteer.com/review/almeda_time_multiple_vibrating_alarm_watch&h=374&w=498&sz=41&hl=en&start=9&sig2=oiXvaAfL5_7S7HZDwS-bDA&um=1&tbnid=6iVDFFJb-m8CBM:&tbnh=98&tbnw=130&ei=hMzuRpauCpSoeJWI3c0G&prev=/images%3Fq%3Dwatch%26svnum%3D10%26um%3D1%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DG

http://images.google.com/imgres?imgurl=http://www.the-gadgeteer.com/assets/majestyk-retro-watch-2.jpg&imgrefurl=http://the-gadgeteer.com/review/majestyk_retro_70s_style_led_watch&h=500&w=344&sz=20&hl=en&start=10&sig2=r4LP6_YVkpK9G60zJFF8Fg&um=1&tbnid=-ZTxrOqbhh1nvM:&tbnh=130&tbnw=89&ei=hMzuRpauCpSoeJWI3c0G&prev=/images%3Fq%3Dwatch%26svnum%3D10%26um%3D1%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DG

http://images.google.com/imgres?imgurl=http://www.stuff.co.nz/images/324902.jpg&imgrefurl=http://www.stuff.co.nz/stuff/4200802a10295.html&h=360&w=300&sz=36&hl=en&start=11&sig2=aJR_kNiaCuSi6WRn35Lhdg&um=1&tbnid=_Qx5Ryz5IiuLiM:&tbnh=121&tbnw=101&ei=hMzuRpauCpSoeJWI3c0G&prev=/images%3Fq%3Dwatch%26svnum%3D10%26um%3D1%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DG

http://images.google.com/imgres?imgurl=http://news.therecord.com/images/assets/299154_3.JPG&imgrefurl=http://news.therecord.com/arts/article/241252&h=398&w=265&sz=19&hl=en&start=12&sig2=MUOyEDxIfBQy1Rr3lbBYog&um=1&tbnid=fya2hmYJ4sFMeM:&tbnh=124&tbnw=83&ei=hMzuRpauCpSoeJWI3c0G&prev=/images%3Fq%3Dwatch%26svnum%3D10%26um%3D1%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DG

http://images.google.com/imgres?imgurl=http://www4.macnn.com/macnn/reviews/DataWatchLarge.jpg&imgrefurl=http://www.macnn.com/reviews/lacie-data-watch.html&h=677&w=481&sz=48&hl=en&start=14&sig2=-uhrBBkoxTMGeV3OyPEcvw&um=1&tbnid=1P4eVEcXdL76AM:&tbnh=139&tbnw=99&ei=hMzuRpauCpSoeJWI3c0G&prev=/images%3Fq%3Dwatch%26svnum%3D10%26um%3D1%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DG

http://images.google.com/imgres?imgurl=http://www.sawf.org/Newsphotos/Fashion/VictoriaBeckham12Sep2007PRB.jpg&imgrefurl=http://news.sawf.org/Fashion/42094.aspx&h=500&w=237&sz=192&hl=en&start=16&sig2=ssAKSl7KZtwHelBqriAItg&um=1&tbnid=znX1GRzVqV2P9M:&tbnh=130&tbnw=62&ei=hMzuRpauCpSoeJWI3c0G&prev=/images%3Fq%3Dwatch%26svnum%3D10%26um%3D1%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DG

http://images.google.com/imgres?imgurl=http://www.amug.org/%7Ejthomas/watch23.jpg&imgrefurl=http://www.amug.org/%7Ejthomas/watch.html&h=326&w=545&sz=12&hl=en&start=17&sig2=-92mL8gA7mEnHeD02fZbfw&um=1&tbnid=LheLU8diVzA2FM:&tbnh=80&tbnw=133&ei=hMzuRpauCpSoeJWI3c0G&prev=/images%3Fq%3Dwatch%26svnum%3D10%26um%3D1%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DG

http://images.google.com/imgres?imgurl=http://www.smh.com.au/ffximage/2007/09/05/wilson5907_narrowweb__300x436,0.jpg&imgrefurl=http://www.smh.com.au/news/people/owen-wilson-off-suicide-watch/2007/09/05/1188783299725.html&h=436&w=300&sz=31&hl=en&start=19&sig2=Uv3ZmhDXmbkCFrSQ5hGLxg&um=1&tbnid=U0eH85uxNJd6gM:&tbnh=126&tbnw=87&ei=hMzuRpauCpSoeJWI3c0G&prev=/images%3Fq%3Dwatch%26svnum%3D10%26um%3D1%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DG

http://images.google.com/imgres?imgurl=http://www.spc.noaa.gov/products/wwa/wwa_new.gif&imgrefurl=http://www.spc.noaa.gov/products/wwa/&h=408&w=582&sz=19&hl=en&start=20&sig2=eF7I-1gTPzhQsX5UkxsBrA&um=1&tbnid=Z3SULa4MXKkC_M:&tbnh=94&tbnw=134&ei=hMzuRpauCpSoeJWI3c0G&prev=/images%3Fq%3Dwatch%26svnum%3D10%26um%3D1%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DG

http://images.google.com/imgres?imgurl=http://www.wcnc.com/weather/wsi/reg_watch.jpg&imgrefurl=http://www.wcnc.com/weather/watches/&h=360&w=480&sz=40&hl=en&start=31&sig2=KdHx4dzUA7TykrDSiuC3Ew&um=1&tbnid=gjvswT-jGQNLWM:&tbnh=97&tbnw=129&ei=hs7uRqTqJaPEePGfgMQG&prev=/images%3Fq%3Dwatch%26start%3D20%26ndsp%3D20%26svnum%3D10%26um%3D1%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DN


Text contexts

icrystal rfid wrist watch features watch masterpiece innovative watch making craftsmanship absolute precision fine charm high scratch resistance anti-

allergenic characteristics make chronometer true jewel s wrist water proof sleek stylish wrist watch solar powered available watch ticket key purse identity card special offer place order rfid wrist watch absolutely free rfid watch black

wrist strap rfid watch orange wrist strap rfid watch stainless steel privacy

disclaimer copyright icrystal pty website




















Topic 1rolex

service repair battery omega replica

tag heuer breitling swiss

replace gucci button price band

…

Topic 2new world media right said

house april

obama islam march bush war

american time

…

Latent Dirichlet allocation (LDA) (Blei et al. ‘03)

• One of several techniques for discovering latent dimensions in bag-of-words data

α θ

z

w

β

φ

K M

Nd

d

word

P(w|z)

topic

document

P(z|d)


Latent Topics

icrystal rfid wrist watch features watch masterpiece innovative watch making craftsmanship absolute precision fine

charm high scratch resistance anti-allergenic characteristics make

chronometer true jewel wrist water proof sleek stylish wrist watch solar powered available watch ticket key

purse identity card special offer place order rfid wrist watch absolutely free rfid watch black wrist strap rfid watch

orange wrist strap rfid watch stainless steel privacy disclaimer copyright

icrystal pty website




















Overview of approaches to web-based object model learning

• Some learn only from image features– (Li et al.07) bootstrap from labeled images– (Fergus et al.05) select correct image topic

• Some incorporate text features– (Schroff et al.07) use a category-independent text classifier– (Berg and Forsyth 06) ask user to sort text topics

• None address polysemy directly– (Loeff et al.06) do image sense discrimination, not

identification

• All rely on labeled images of correct sense


Next


• Unsupervised text and image models– Related work– WISDOM: probabilistic dictionary-based image sense

model*– Concrete WISDOM: identifying tangible objects

*see Saenko and Darrell. Unsupervised learning of visual sense models for polysemous words. NIPS 2008.


How can we ground image senses in the absence of labeled examples?

WORDNET: Noun•S: (n) watch, ticker (a small portable timepiece) •S: (n) watch (a period of time (4 or 2 hours) during which some of a ship's crew are on duty) •S: (n) watch, vigil (a purposeful surveillance to guard or observe) •S: (n) watch (the period during which someone (especially a guard) is on duty) •S: (n) lookout, lookout man, sentinel, sentry, watch, spotter, scout, picket (a person employed to keep watch for some anticipated event) •S: (n) vigil, watch (the rite of staying awake for devotional purposes (especially on the eve of a religious festival))

WIKIPEDIA: Watch may also refer to:•Watch system, a period of work duty •Tropical cyclone warnings and watches, alerts issued to coastal areas threatened by severe storms •Watch (Unix), a Unix command •Watch (TV channel) a TV station launching in Autumn 2008 •Watch (computer programming) •Help:Watching pages on Wikipedia •Watch (dog), name of the pet dog in the the Boxcar Children

D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. ACL, 1995.

http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=watch&i=0&h=0000000000000#c

http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=ticker



http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=vigil



http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=lookout

http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=lookout+man

http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=sentinel

http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=sentry

http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=spotter

http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=scout

http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=picket


http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=vigil

http://en.wikipedia.org/wiki/Watch_system

http://en.wikipedia.org/wiki/Tropical_cyclone_warnings_and_watches

http://en.wikipedia.org/wiki/Watch_%28Unix%29

http://en.wikipedia.org/wiki/Watch_%28TV_channel%29

http://en.wikipedia.org/wiki/Watch_%28computer_programming%29

http://en.wikipedia.org/wiki/Help:Watching_pages

http://en.wikipedia.org/wiki/Help:Watching_pages

http://en.wikipedia.org/w/index.php?title=Watch_%28dog%29&action=edit&redlink=1

http://en.wikipedia.org/wiki/The_Boxcar_Children

http://en.wikipedia.org/wiki/The_Boxcar_Children


Sense-specific classifier

training images

Web Image Sense DictiOnary Model

Search Engine WatchSearch Engine Watch is the authoritative guide to search engine marketing (SEM) and search engine optimization (SEO), offering the latest news about search ...searchenginewatch.com/ - 38k - Cached - Similar pages - Note thiswatch - MDCWatches for assignment to a property named prop in this object, calling handler(prop, oldval, newval) whenever prop is set and storing the return value in ...developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/Object/watch - 30k - Cached - Similar pages - Note this


dictionary definitions

unlabeled text

dictionary model P( sense | data)

WISDOM does:1. image sense

disambiguation

2. dataset collection

3. classification of unseen images

noun

web images

fosil wrist watch a800 x 628 - 107k - jpg

amgmedia.com

watch-1(ticker)

http://searchenginewatch.com/


http://64.233.169.104/search?q=cache:jTdukt3qH0oJ:searchenginewatch.com/+watch&hl=en&ct=clnk&cd=5&gl=us

http://www.google.com/search?hl=en&pwst=1&q=related:searchenginewatch.com/

http://www.google.com/search?source=ig&hl=en&rlz=&=&q=watch&btnG=Google+Search

http://developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/Object/watch


http://64.233.169.104/search?q=cache:RxW6Lo3HMb4J:developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/Object/watch+watch&hl=en&ct=clnk&cd=6&gl=us

http://www.google.com/search?hl=en&pwst=1&q=related:developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/Object/watch























http://images.google.com/imgres?imgurl=http://www.amgmedia.com/freephotos/fosil-watch.jpg&imgrefurl=http://www.amgmedia.com/freephotos/free-photos-4.html&usg=__EsZPhtTi9claAymc0st2X-Papks=&h=628&w=800&sz=107&hl=en&start=4&sig2=SLFfr8aSQu5L6O1kUkXEpQ&um=1&tbnid=-NK3_bsKKNnd_M:&tbnh=112&tbnw=143&prev=/images%3Fq%3Dwatch%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DN%26um%3D1&ei=65h5SrjCCZHtlAeXxYSaBQ


WISDOM: Using dictionary entries to ground senses

• Use entry text to learn a probability distribution over words for that sense

• Problem: entries contain very little text– Expand by adding synonyms, example sentences, etc.– Still, very few words are covered!

•S: (n) mouse (any of numerous small rodents typically resembling diminutive rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails) •S: (n) mouse (any of numerous small rodents typically resembling diminutive rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails) •direct hyponym / full hyponym

•S: (n) house mouse, Mus musculus (brownish-grey Old World mouse now a common household pest worldwide) •S: (n) harvest mouse, Micromyx minutus (small reddish-brown Eurasian mouse inhabiting e.g. cornfields) •S: (n) field mouse, fieldmouse (any nocturnal Old World mouse of the genus Apodemus inhabiting woods and fields and gardens) •S: (n) nude mouse (a mouse with a genetic defect that prevents them from growing hair and also prevents them from immunologically rejecting human cells and tissues; widely used in preclinical trials) •S: (n) wood mouse (any of various New World woodland mice)

•direct hypernym / inherited hypernym / sister term •S: (n) rodent, gnawer (relatively small placental mammals having a single pair of constantly growing incisor teeth specialized for gnawing)

http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=mouse&h=000000&j=0#c


http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&r=1&s=mouse&i=1&h=110000010000000#c

http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=mouse&i=2&h=110000010000000#c

http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=house+mouse

http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=Mus+musculus




http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=harvest+mouse

http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=Micromyx+minutus




http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=field+mouse

http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=fieldmouse


http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=nude+mouse


http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=wood+mouse







http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=rodent

http://wordnet.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=gnawer


WISDOM: Probabilistic dictionary-based model

• Main idea:

– Using LDA, learn latent sense-like dimensions on a large amount of related text,

– Model dictionary senses in LDA space:

• Map image contexts to topics• Map topics to senses



unlabeled text

LDA






















WISDOM sense model

• Given a query word with sense s with values in set {1,…,S}, and a text document d, the probability of sense is

d

z

N

s

• Define the likelihood of topic z given sense s with entry words es= w1,…,wEs as

• To compute probability of sense given topic


WISDOM: Incorporating Image Features

• Use LDA to discover visual topics v=1,…,L,

• Then estimate the conditional probability P(s|v)

• Given a test image di*, we can compute

• Combine contributions of image and text:


WISDOM classifier

SVM classifier

training images



dictionary definitions

unlabeled text

dictionary model P( sense | data)

noun

web images

fosil wrist watch a800 x 628 - 107k - jpg

amgmedia.com

watch-1(ticker)





























http://images.google.com/imgres?imgurl=http://www.amgmedia.com/freephotos/fosil-watch.jpg&imgrefurl=http://www.amgmedia.com/freephotos/free-photos-4.html&usg=__EsZPhtTi9claAymc0st2X-Papks=&h=628&w=800&sz=107&hl=en&start=4&sig2=SLFfr8aSQu5L6O1kUkXEpQ&um=1&tbnid=-NK3_bsKKNnd_M:&tbnh=112&tbnw=143&prev=/images%3Fq%3Dwatch%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DN%26um%3D1&ei=65h5SrjCCZHtlAeXxYSaBQ


Evaluation datasets

• Collected by querying Image Search – MIT-ISD: bass, face, mouse, speaker, watch

– MIT-OFFICE: cellphone, fork, hammer, keyboard, mug, pliers, scissors, stapler, telephone, watch

– UIUC-ISD: bass, crane, squash

core relatedcore relatedunrelated ???


Experimental Setup

1. Task: Image sense disambiguation (ISD) in search results– Separate images according to visual sense

– “core” labels are positive class, “related” and “unrelated” negative

– Metrics: true positives vs. false positives (ROC), recall-precision curve (RPC)

2. Task: object classification in a novel image– Classify image as having correct object category or not

– “core” labels are positive class, other keyword’s “core” senses are negative class

– Metric: percent correct


ISD example results

squash: sports

squash: vegetable

bass: musical instrument

bass: fish

bass: raw web image data

squash: raw web image data


yahoo

musical range

polyph. range

male singer

sea bass

freshwater bass

basso, voice

instrument

spiny fish

yahoo

musical range

polyph. range

male singer

sea bass

freshwater bass

basso, voice

instrument

spiny fish

ISD Results: ROC using each WordNet sense for BASS

BASSTr

ue p

ositi

ve ra

te

False positive rate


ISD Results: RPC using true sense

yahoo wisdom

Retrieval of core senses on UIUC-ISD


Results: object classification

• Baseline approach: – Automatically generate sense-specific keywords from WordNet– Append word to synonyms and direct hypernyms– Limit queries to 3 terms– Example: mouse + computer, mouse + electronic device

• Plot shows average accuracy across five objects in the MIT-ISD dataset (each is a two-class problem with chance performance of 50%)

85%

75%

65%

55%

50 100 150 200 250 300

Number of training images

baselinewisdom

Accu

racy


Next


• Unsupervised text and image models– Related work– WISDOM: probabilistic dictionary-based image sense model– Concrete WISDOM: identifying tangible objects*

*see Saenko and Darrell, Filtering Abstract Senses From Image Search Results, NIPS 2009.


Query Word: “cup”

Online DictionaryWord to search for:Noun

Online DictionaryWord to search for:Noun

cup Search Dictionary

• cup (a small open container usually used for drinking; usually has a handle) "he put the cup back in the saucer"; "the handle of the cup was missing"

• cup, loving cup (a large metal vessel with two handles that is awarded as a trophy to the winner of a competition) "the school kept the cups is a special glass case”

• a major sporting event or competition “the world cup”, “the Stanley cup”

Concrete WISDOM

Object Sense: drinking container

Abstract Sense: sporting event

Object Sense: loving cup (trophy)

Removing Abstract Senses


mouse

rodent

beaver

mammal

cow…

…

How can we identify abstract senses?

Mouse: Noun•<noun.animal>S: (n) mouse (any of numerous small rodents typically resembling diminutive rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails) •<noun.state>S: (n) shiner, black eye, mouse (a swollen bruise caused by a blow to the eye) •<noun.person>S: (n) mouse (person who is quiet or timid) •<noun.artifact>S: (n) mouse, computer mouse (a hand-operated electronic device that controls the coordinates of a cursor on your computer screen…)

• Idea: use the ontological information available via WordNet– semantic relations between concepts (hypernym, part, etc.)– lexical tags:

http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=1&o3=&s=mouse&i=0&h=000000#c




http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=1&o3=&s=shiner

http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=1&o3=&s=black+eye





http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=1&o3=&s=computer+mouse


Experimental Setup

Table: Concrete Senses Identified by WISDOM

• Task: ISD using concrete-sense WISDOM– all “core” and “related” labels of keyword are positive class,

“unrelated” labels are negative class


Results: Filtering visual senses

Yahoo Search: “telephone”DICTIONARY

1: (n) telephone, phone, telephone set (electronic equipment that converts sound into electrical signals that can be transmitted over distances and then converts received signals back into sounds)

2: (n) telephone, telephony (transmitting speech at a distance)

http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=telephone&i=0&h=000#c

http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=phone

http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=telephone+set


http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=telephony


Results: Filtering visual senses

Artifact sense: “telephone”DICTIONARY

1: (n) telephone, phone, telephone set (electronic equipment that converts sound into electrical signals that can be transmitted over distances and then converts received signals back into sounds)

2: (n) telephone, telephony (transmitting speech at a distance)


http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=phone

http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=telephone+set


http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=telephony


Results: RPC of all concrete senses

Retrieval of core+related concrete senses on UIUC-ISD

yahoo wisdom


Further Improvement: Topic adaptation

• Original LDA topics are learned on text-only unlabeled data• Adapt to image-text data via semi-supervised Gibbs sampling• E.g.: one of “fork” topics:

product bike null tool tube seal set price oil

knife spoon spring ship use item accessory

handle shop order remove store custom

home weight steel supply cap clamp fit false

. . .

cutlery knife spoon product set price handle

steel tool item

stainless null bike

tube seal oil knive

kitchen utensil ship

order use table sp ring

supply design piece

carve weight shop . . .


“fork”: using original topics

unrelated:fork lift

road forkbike fork

knife…


“fork”: using adapted topics

unrelated:fork lift

road forkbike fork

knife…


Results on MIT-OFFICE

• The average area under the RPC improves from 0.47 to 0.57• Detailed RPCs:

yahoo wisdom


Conclusions

• Showed that combining speech with image input may be advantageous for object recognition

• Presented WISDOM, an unsupervised method to learn sense-specific object models from images and text harvested from the web

• Extended WISDOM to filter out non-physical word senses based on WordNet semantic structure


Future work: WISDOM-enabled interactive training

speech text

image

dictionary

supervised classifierWISDOM

visual sense disambiguation: a multimodal approach phd thesis by kate saenko computer science and ai...

Documents

object modelsdeveloped

visual models

image modelsrelated

unlabeled image data

object category recognition

visual sense disambiguation

masterpiece innovative

free rfid