ibm research trecvid-2007 video retrieval system · video retrieval system apostol (paul) natsev...
Post on 21-Sep-2020
5 Views
Preview:
TRANSCRIPT
© 2007 IBM Corporation
IBM Research TRECVID-2007 Video Retrieval System
Apostol (Paul) NatsevIBM T. J. Watson Research CenterHawthorne, NY 10532 USA
On Behalf Of:Murray Campbell, Alexander Haubold, John R. Smith, Jelena Tesic, Lexing Xie, Rong Yan
© 2007 IBM Corporation2
Acknowledgement
� IBM Research
– Murray Campbell
– Matthew Hill
– Apostol (Paul) Natsev
– Quoc-Bao Nguyen
– Christian Penz
– John R. Smith
– Jelena Tesic
– Lexing Xie
– Rong Yan
� UIUC
– Ming Liu
– Xu Han
– Xun Xu
– Thomas Huang
� Columbia Univ.
– Alexander Haubold
� Carnegie Mellon Univ.
– Jun Yang
© 2007 IBM Corporation
Part 1 of 2:Automatic Search
© 2007 IBM Corporation4
Outline
Overall System
Summary
Review of Baseline Retrieval Experts
Performance Analysis
Summary (Repeated)
© 2007 IBM Corporation5
System Overview
Gri
dF
eatu
res
Gri
dF
eatu
res
Color Mom.Color Mom.
Wav. TextureWav. Texture
Videos:-Keyframes- ASR/MT
Videos:-Keyframes- ASR/MT
Visual Search(Bagged SVM)
Visual Search(Bagged SVM)
Query-dynamic(Weighted Avg)
Query-dynamic(Weighted Avg)
Features Retrieval Experts Fusion
Glo
bal
Featu
res
Glo
bal
Featu
res
Color Corr.Color Corr.
Color Hist.Color Hist.
Cooc. TextureCooc. Texture
Edge Hist.Edge Hist.
Text-LR5Text-LR5
Text-LR10Text-LR10Sp
eech
Sp
eech
Co
ncep
ts
LSCOM-155LSCOM-155
MediaNet-50MediaNet-50
TV07-36TV07-36
Text SearchText Search
Concept Search
Model Vectors(Bagged SVM)
Statistical Map(co-occurrence)
WordNet Map(Brown stats)
WordNet Map(web stats)
Queries:-Text blurb- Examples
Query-indep.(Average)
Query-indep.(Average)
V
S
M1
M2
T
M3
Text
MSV.avg
TMSV.avg
TMSV.qdyn
TMSVC.qdyn
Query-indep.(Average)
Query-indep.(Average)
Query-dynamic(Weighted Avg)
Query-dynamic(Weighted Avg)
Query-indep.(Average)
Query-indep.(Average)
Runs
© 2007 IBM Corporation6
Summary: What Worked and What Didn’t?
Grade
����
���� ���� ����
����
���� ����
���� ���� ����
���� ���� ����
����
���� ����
����
� Baseline retrieval experts
� Speech/text-based
� Concept-based (statistical mapping)
� Concept-based (WordNet mapping, Brown corpus stats)
� Concept-based (WordNet mapping, web stats)
� Concept-based (query examples modeling)
� Visual-based
� Experiments
� Improving query-to-concept mapping: web-based stats
� Leveraging external resources: type C runs
� Query-dynamic multimodal fusion
© 2007 IBM Corporation7
Baseline Retrieval Experts: Review of Approaches
� Speech/Text-Based Retrieval
– Auto-query expansion with JuruXML search engine (Volkmer et al., ICME’06)
� Visual-Based Retrieval
– Bagged SVM modeling of query topic examples (Natsev et al., MM’05)
� Concept-Based Retrieval (G2 Statistical Map)
– Based on co-occurrence of ASR terms and concepts (Natsev et al., MM’07)
� Concept-Based Retrieval (WordNet Map, Brown Stats)
– Based on JCN similarity, IC from Brown Corpus (Haubold et al., ICME’06)
� Concept-Based Retrieval (WordNet Map, Web Stats)
– Based on JCN similarity, IC from WWW sample
� Concept-Based Retrieval (Content-Based Modeling)
– Bagged SVM modeling of query topic examples (Tesic et al., CIVR’07)
© 2007 IBM Corporation8
Improving Query-to-Concept Mapping
� WordNet similarity measures
– Frequently used for query-to-concept mapping
– Frequently based on Information Content (IC) to model term saliency
• Resnik: evaluates information content (IC) of common root
• Jiang-Conrath: evaluates IC of common root and ICs of terms
– IC typically estimated from 1961 Brown Corpus
• IC from Brown Corpus outdated by >40 years
• 76% of words in WordNet not in Brown Corpus (so IC = 0)
� Idea: create approximation using WWW
– Perform frequency analysis over large sample of web pages
– Google page count as indicator of frequency
© 2007 IBM Corporation9
IC from Large-Scale Web Sampling
� Sample of web pages:
– 1,231,929 documents (~1M)
– 1,169,368,161 WordNet terms (~1B)
– Distribution similar to Brown Corpus
– Therefore: potentially useful as IC
� Google page count:
– As a proxy to term frequency counts
– Term frequency ≈ Doc. Frequency?
– Experiments show linear relationship
– Therefore: Potentially useful as IC
© 2007 IBM Corporation10
Outline
Overall System
Summary
Review of Baseline Retrieval Experts
Performance Analysis
Summary (Repeated)
© 2007 IBM Corporation11
Evaluation Results: Baseline Retrieval Experts
� Statistical concept-based run did not generalize
� Web-based IC led to 20% improvement in WordNet runs
� Concept-based runs performed better than speech/text-based runs
� Content-based runs performed best
Comparison of Unimodal Retrieval Experts
0.0144
0.005 0.005
0.0149 0.01520.0146
0.01850.0177
0.02280.0238
0.0162
0.0302
0.000
0.005
0.010
0.015
0.020
0.025
0.030
0.035
All topics All topics but 219
Me
an
Av
era
ge
Pre
cis
ion
Speech/Text-Based Concepts (Co-occurrence Map) Concepts (WordNet Brown Map)
Concepts (WordNet Web Map) Concepts (Content-Based) Visual Content-Based
© 2007 IBM Corporation12
Baseline Retrieval Experts: JAWS Analysis
� For the adventurous:
– Stacked chart showing contribution of each expert per topic
Comparison of Unimodal Retrieval Experts
0%
20%
40%
60%
80%
100%
Cook
cha
ract
ercr
owd
bicy
cle
stre
et b
uildin
gs
mou
ntain
s
stre
et n
ight
wate
rfron
t
mus
ical
inst
rum
ents
road
from
veh
icle
stre
et p
rote
st
door
ope
nedbo
at
cana
l riv
er
stre
et m
arke
tbr
idge
wom
an in
terv
iew
shee
p goat
s
pers
on te
lepho
ne
peop
le ta
ble
class
room tra
in
peop
le s
tairs
keyb
oard
peop
le d
ogsR
ela
tive c
on
trib
uti
on
per
exp
ert
Speech/Text-Based Concepts (Co-occurrence Map)
Concepts (WordNet Brown Map) Concepts (WordNet Web Map)
Concepts (Content-Based) Visual Content-Based
© 2007 IBM Corporation13
Summary of Submitted and Unsubmitted IBM Runs
0.0272CTMSV-C.qdynTMSWSCVqdynQuery-dynamic fusion, type C (T + MSW + SC + V)
0.0372C-TMWSCVavgMultimodal AVG fusion, type C (T + MW + SC + V)
0.0303C-TMSWSCVavgMultimodal AVG fusion, type C (T + MSW + SC + V)
0.0210ATMSV.qdynTMSSVqdynQuery-dynamic fusion, type A (T + MS + S + V)
0.0249C-SCConcept baseline (content-based, type C)
0.0638C-OracleOracle fusion (pick best run for each topic)
0.0302A-VVisual baseline
TMSSVavg
MBSVavg
MSSVavg
S
MW
MB
MS
T
Code
0.0231ATMSVMultimodal AVG fusion, type A (T + MS + S + V)
0.0213AMSVNon-text baseline (MS + S + V)
0.0228A-Concept baseline (content-based, type A)
0.0177C-Concept baseline (WordNet, Web stats)
0.0146A-Concept baseline (WordNet, Brown stats)
0.0301A-Non-text baseline (MB + S + V)
0.0045A-Concept baseline (stat)
0.0149ATextText baseline
MAPTypeRun IDDescription
© 2007 IBM Corporation14
Overall Performance Comparison
� Best run from each organization shown� Submitted IBM runs in third tier, improve by dropping failed run� IBM runs achieve highest AP scores on 5 of 24 topics
Automatic Search (Best-of Team Performance)
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
F_
A_
2_
RA
-US
TC
-
SJT
U-S
EA
RC
H_
1
F_
C_
IBM
_O
racle
F_
A_
2_
CT
_3
F_
A_
2_
ua
_4
F_
A_
1_
cu
-
Img
Ba
se
lin
e_
4
F_
A_
2_
IBM
-Un
offic
ial
F_
A_
2_
yU
HK
_S
CS
_2
F_
A_
2_
SIO
N_
3
F_
B_
1_
A-M
M_
5
F_
A_
1_
CS
B_
vis
bl_
1
F_
C_
2_
IBM
-TM
SV
-
C.q
dyn
_3
F_
A_
2_
OM
_6
_1
F_
C_
1_
UG
Sys_
2
F_
A_
1_
SR
G_
KN
2_
2
F_
A_
2_
SC
-2_
2
F_
A_
2_
UT
wik
i-t2
-
nm
_2
F_
C_
1_
NA
I_1
F_
A_
2_
ow
a0
7A
S0
3_
2
F_
A_
1_
UT
en
_1
F_
A_
1_
JT
U_
FA
_1
_1
Me
an
Av
era
ge
Pre
cis
ion
Best IBM Submitted Run
Best IBM Unsubmitted Run
Best IBM Oracle Fusion Run
© 2007 IBM Corporation15
IBM Performance Relative to Best AP Per Topic
� Good (>90%): street night, street market, sheep/goat, road/vehicle, people telephone, canal/river
� So-so (40-60%): mountains, waterfront, boat, train, crowd, street protest, street buildings
� Bad (<25%): people table, people stairs, people dogs, people keyboard
IBM Automatic Search
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
stre
et n
ight
shee
p go
ats
road
from
veh
icle
stre
et m
arke
t
pers
on te
leph
one
cana
l riv
er
door
ope
ned
bicy
cle
mou
ntai
ns
stre
et p
rote
stw
ater
front
stre
et b
uild
ings
crow
d
Coo
k ch
arac
ter
boat
train
mus
ical i
nstru
men
tsbr
idge
clas
sroo
m
wom
an in
terv
iew
peop
le ta
ble
peop
le s
tairs
peop
le d
ogs
keyb
oard
Pe
rfo
rma
nc
e R
ela
tiv
e t
o B
es
t P
er-
To
pic
Ru
n (
%)
© 2007 IBM Corporation16
Summary: What Worked and What Didn’t?
Grade
����
���� ���� ����
����
���� ����
���� ���� ����
���� ���� ����
����
���� ����
����
� Baseline retrieval experts
� Speech/text-based
� Concept-based (statistical mapping)
� Concept-based (WordNet mapping, Brown corpus stats)
� Concept-based (WordNet mapping, web stats)
� Concept-based (query examples modeling)
� Visual-based
� Experiments
� Improving query-to-concept mapping: web-based stats
� Leveraging external resources: type C runs
� Query-dynamic multimodal fusion
© 2007 IBM Corporation
Part 2 of 2:Interactive Search
© 2007 IBM Corporation18
Annotation-based Interactive Retrieval
� Model interactive search as a video annotation task
– Consider each query topic as a keyword, annotate video keyframes
– Extend from CMU’s Extreme Video Retrieval system [Hauptmann et al., MM’06]
� Hybrid annotation system
– Minimize annotation time by leveraging two annotation interfaces
– Tagging: Flickr, ESP game [von Ahn et al., CHI’04]
– Browsing: IBM’s EVA [Volker et al., MM’05], CMU’s EVR [Hauptmann et al., MM’06]
� Formal analysis by modeling the interactive retrieval process
– Tagging-based annotation time per shot
– Browsing-based annotation time per shot
© 2007 IBM Corporation19
Manual Annotation (I) : Tagging
� Allow users to associate a single image at a time with one or more keywords (the most widely used manual annotation approaches)
� Advantages
– Freely choose arbitrary keywords to annotate
– Only need to annotate relevant keywords
� Disadvantages
– “Vocabulary mismatch” problem
– Inefficient to design and type keywords
� Suitable for annotating rare keywords
© 2007 IBM Corporation20
Formal Time Model for Tagging
� Key factors for tagging time model
– Number of keywords K for image l
– Annotation time for kth word t’fk
– Initial setup time for a new image t’s
– Noise term ε (zero-mean distribution)
� Annotation time for one image
� Total expected annotation time for the entire collection containing L images
– Assumption: the expected time to annotate the kth word is constant tf
1...f fk s fk s
k
T t t t t tε ε′ ′ ′ ′ ′= + + + + = + +∑
( ) ( )( )l
l
total fk s l f s
l k l
E T E t E t K t t′ ′= + = +∑∑ ∑
© 2007 IBM Corporation21
Validation of Tagging Time Model
� User study on TRECVID’05 development data
– A user to manually tag 100 images using 303 keywords
– If the model is correct, a linear fit should be found in the results
– The annotation results fit the model very well
© 2007 IBM Corporation22
Manual Annotation (II) : Browsing
� Allow users to associate multiple images with a single word at the same time
� Advantages
– Efficient to annotate each pair of
images and words
– No “vocabulary mismatch”
� Disadvantages
– Need to spend time on judging both
relevant and irrelevant pairs
– Start with controlled vocabulary
– Annotate one keyword at a time
� Suitable for annotating frequent keywords
© 2007 IBM Corporation23
Formal Time Model for Browsing
� Key factors for browsing time model
– Number of relevant images Lk for a word k
– Annotation time for a relevant image t’p
– Annotation time for an irrelevant image t’n
– Noise term ε (zero-mean distribution)
� Annotation time for all images w.r.t. a keyword
� Total expected annotation time for the entire collection containing L images
– Assumption: the expected time to annotate a relevant (irrelevant) image is constant
1 1
k kL L L
pl nl
l l
T t t ε−
= =
′ ′= + +∑ ∑
( ) ( ) ( )( )( )k k
k k
total pl nl k p k n
k l l k
E T E t E t L t L L t
′ ′= + = + −
∑ ∑ ∑ ∑
© 2007 IBM Corporation24
Validation of Browsing Time Model
� User study on TRECVID’05 development data
– Three users to manually browse images in 15 minutes (for 25 keywords )
– If the model is correct, a linear fit should be found in the results
– The annotation results fit the model for all users
© 2007 IBM Corporation25
Video Retrieval as Hybrid Annotation
� Jointly annotates all topics at the same time
� Switches between tagging and browsing annotation interfaces in order to minimize the overall annotation time
– Formally model the annotation time as functions of word frequency, time
per word, and annotation interfaces
– Online machine learning to select images, topics and interfaces based on
the annotation models
� More details will be released in the final notebook paper
– See related analysis in R. Yan et al. [ACM MM 2007 Workshop on Many Faces of
Multimedia]
© 2007 IBM Corporation26
TRECVID-2007 Performance Analysis
� The proposed approach allows users to annotate 60% of the image-topic pairs, as compared with ~10% allowed by simple browsing
� Balance between tagging & browsing: 1529 retrieved shots from tagging, 797 retrieved shots from browsing
� Simple temporal expansion can improve MAP from 0.35 to 0.43
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
I_A
_EER
_Exp
and
I_A
_2_J
W-E
ER_1
I_C
_2_VG
G_I
_1_1
I_B
_2_A
-MM
_1I_
A_2
_CT_1
I_A
_2_A
L_C
O_3
I_A
_2_u
a_6
I_A
_2_a
ce-s
hot-E
-2_2
I_A
_2_d
an_P
_1I_
a_2_S
C-1
_1I_
C_1
_t292
.ITI_
2I_
c_2_
UQ
2_2
Mean
Avera
ge P
recis
ion
© 2007 IBM Corporation27
TRECVID-2007 Per-Query Performance Analysis
� Good: “telephone”, “interview”, “mountains”
� Not as good: “canal”, “bicycle”, “classroom”, “dog”
� Highest AP scores on 10 out of 24 topics
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
stre
et n
ight
shee
p go
ats
road
from
veh
icle
stre
et m
arke
t
pers
on te
leph
one
cana
l riv
er
door o
pene
dbi
cycl
e
mou
ntain
s
stre
et p
rote
stw
ater
front
stre
et bui
ldin
gscr
owd
Coo
k ch
arac
ter
boat
train
mus
ical i
nstru
men
tsbr
idge
clas
sroom
wom
an in
terv
iew
peop
le ta
ble
peop
le s
tairs
peop
le d
ogs
keyb
oard
Av
era
ge
Pre
cis
ion
Best IBM Best Non-IBM
© 2007 IBM Corporation28
Potential Improvement
� Search beyond keyframe (look at the I-frame)
� Search beyond I-frame (leverage text info or temporal context)
� Better online learning algorithms to search for more shots in the same amount of time (include temporal information)
FindingSheep/Goats
FindingSheep/Goats
top related