trecvid 2016 : video to text description
TRANSCRIPT
![Page 1: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/1.jpg)
TRECVID 2016
Video to Text Description
NEW Showcase / Pilot Task(s)
Alan SmeatonDCU
Marc RitterTUC
George AwadNIST; Dakota Consulting, Inc
1TRECVID 2016
![Page 2: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/2.jpg)
Goals and Motivations
Measure how well can automatic system describe a video in
natural language.
Measure how well can an automatic system match high-level
textual descriptions to low-level computer vision features.
Transfer successful image captioning technology to the video
domain.
Real world Applications
Video summarization
Supporting search and browsing
Accessibility - video description to the blind
Video event prediction
2TRECVID 2016
![Page 3: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/3.jpg)
• Given a set of : 2,000 URLs of Twitter vine videos.
2 sets (A and B) of text descriptions for each of 2,000 videos.
• Systems are asked to submit results for two
subtasks:
1. Matching & Ranking:
Return for each URL a ranked list of the most likely text description from
each set of A and of B.
2. Description Generation:
Automatically generate a text description for each URL.
3
TASK
TRECVID 2016
![Page 4: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/4.jpg)
Video Dataset
• Crawled 30k+ Twitter vine video URLs.
• Max video duration == 6 sec.
• A subset of 2,000 URLs randomly selected.
• Marc Ritter’s TUC Chemnitz group supported manual annotations:• Each video annotated by 2 persons (A and B).
• In total 4,000 textual descriptions (1 sentence each) were produced.
• Annotation guidelines by NIST:
• For each video, annotators were asked to combine 4 facets if applicable:
• Who is the video describing (objects, persons, animals,…etc)
• What are the objects and beings doing ? (actions, states, events,…etc)
• Where (locale, site, place, geographic,...etc)
• When such as time of day, season, ...etc
4TRECVID 2016
![Page 5: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/5.jpg)
Annotation Process Obstacles
Bad video quality
A lot of simple scenes/events with
repeating plain descriptions
A lot of complex scenes containing
too many events to be described
Clips sometimes appear too short
for a convenient description
Audio track relevant for description
but has not been used to avoid
semantic distractions
Non-English Text overlays/subtitles
hard to understand
Cultural differences in reception of
events/scene content
Finding a neutral scene description
appears as a challenging task
Well-known people in videos may
have influenced (inappropriately)
the description of scenes
Specifying daytime (frequently)
impossible for indoor-shots
Description quality suffers from long
annotation hours
Some offline vines were detected
A lot of vines with redundant or
even identical content
TRECVID 2016 5
![Page 6: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/6.jpg)
Annotation UI Overview
TRECVID 2016 6
![Page 7: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/7.jpg)
Annotation Process
TRECVID 2016 7
![Page 8: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/8.jpg)
Annotation Statistics
UID # annotations Ø (sec) (sec) (sec) # time (hh:mm:ss)
0 700 62.16 239.00 40.00 12:06:12
1 500 84.00 455.00 13.00 11:40:04
2 500 56.84 499.00 09.00 07:53:38
3 500 81.12 491.00 12.00 11:16:00
4 500 234.62 499.00 33.00 32:35:09
5 500 165.38 493.00 30.00 22:58:12
6 500 57.06 333.00 10.00 07:55:32
7 500 64.11 495.00 12.00 08:54:15
8 200 82.14 552.00 68.00 04:33:47
total 4400 98.60 552.00 09.00 119:52:49
TRECVID 2016 8
![Page 9: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/9.jpg)
TRECVID 2016 9
Samples of captions
A B
a dog jumping onto a couch a dog runs against a couch indoors at
daytime
in the daytime, a driver let the
steering wheel of car and slip
on the slide above his car in the
street
on a car on a street the driver climb out of his
moving car and use the slide on cargo area
of the car
an asian woman turns her head an asian young woman is yelling at another
one that poses to the camera
a woman sings outdoors a woman walks through a floor at daytime
a person floating in a wind
tunnel
a person dances in the air in a wind tunnel
![Page 10: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/10.jpg)
Run Submissions & Evaluation Metrics
• Up to 4 runs per set (for A and for B) were allowed
in the Matching & Ranking subtask.
• Up to 4 runs in the Description Generation subtask.
• Mean inverted rank measured the Matching &
Ranking subtask.
• Machine Translation metrics including BLEU and
METEOR were used to score the Description
Generation subtask.
• An experimental “Semantic Similarity” metric (STS)
was also tested.
10TRECVID 2016
![Page 11: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/11.jpg)
BLEU and METEOR
• BLEU [0..1] (bilingual evaluation understudy), used in MT
to evaluate quality of text … approximate human
judgement at a corpus level
• Measures the fraction of N-grams (up to 4-gram) in
common between source and target
• N-gram matches for a high N (e.g., 4) rarely occur at
sentence-level, so poor performance of BLEU@N
especially when comparing only individual sentences,
better comparing paragraphs or higher
• Often we see B@1, B@2, B@3, B@4 … we do B
TRECVID 2016 11
![Page 12: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/12.jpg)
METEOR
• METEOR (Metric for Evaluation of Translation with Explicit
Ordering)
• Computes unigram precision and recall, extending exact
word matches to include similar words based on WordNet
synonyms and stemmed tokens
• Based on the harmonic mean of unigram precision and
recall, with recall weighted higher than precision
• This is an active area … CIDEr is another recent metric …
no universally agreed metric(s)
TRECVID 2016 12
![Page 13: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/13.jpg)
UMBC STS measure [0..1]
• We’re exploring STS – based on distributional similarity
and Latent Semantic Analysis (LSA) … complemented
with semantic relations extracted from WordNet
TRECVID 2016 13
![Page 14: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/14.jpg)
Participants (7 out of 11 teams finished)
Matching & Ranking Description Generation
DCU
INF(ormedia)
Mediamill (AMS)
NII (Japan + Vietnam)
Sheffield_UETLahore
VIREO (CUHK)
Etter Solutions
14TRECVID 2016
Total of 46 runs Total of 16 runs
![Page 15: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/15.jpg)
Task 1: Matching & Ranking15TRECVID 2016
Person reading newspaper outdoors at daytime
Three men running in the street at daytime
Person playing golf outdoors in the field
Two men looking at laptop in an office
x 2,000 x 2,000 type A … and ... X 2,000 type B
![Page 16: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/16.jpg)
Matching & Ranking results by run16TRECVID 2016
0
0.02
0.04
0.06
0.08
0.1
0.12m
edia
mill
_ta
sk1_
se
t.B
.run2
.txt
me
dia
mill
_ta
sk1_
se
t.B
.run4
.txt
me
dia
mill
_ta
sk1_
se
t.B
.ru
n3
.txt
me
dia
mill
_ta
sk1_
se
t.B
.run1
.txt
me
dia
mill
_ta
sk1_
se
t.A
.run2
.txt
me
dia
mill
_ta
sk1_
se
t.A
.run3
.txt
me
dia
mill
_ta
sk1_
se
t.A
.run1
.txt
me
dia
mill
_ta
sk1_
se
t.A
.run4
.txt
vireo
_fu
sin
g_
all.
B.txt
vireo
_fu
sin
g_
fla
t.A
.txt
vireo
_fu
sin
g_
fla
t.B
.txt
vireo
_fu
sin
g_
avera
ge
.B.t
xt
vireo
_fu
sin
g_
all.
A.txt
vireo
_fu
sin
g_
avera
ge
.A.t
xt
vireo
_concep
t.B
.txt
vireo
_concep
t.A
.txt
ett
er_
mand
r.B
.1
ett
er_
mand
r.B
.2
ett
er_
mand
r.A
.1
DC
U.a
da
pt.b
m25
.B.s
wape
d.txt
ett
er_
mand
r.A
.2
DC
U.a
da
pt.b
m25
.A.s
wape
d.txt
INF
.ra
nked_
list.
B.n
o_score
.txt
DC
U.a
da
pt.fu
sio
n.B
.sw
aped
.txt.tx
t
DC
U.a
da
pt.fu
sio
n.A
.sw
aped
.txt.tx
t
INF
.ra
nked_
list.
A.n
o_score
.txt
INF
.ra
nked_
list.
A.n
ew
.txt
INF
.ra
nked_
list.
B.n
ew
.txt
NII
.ru
n-2
.A.txt
DC
U.v
ine
s.t
extD
escription.A
.testi…
NII
.ru
n-4
.A.txt
NII
.ru
n-1
.B.txt
NII
.ru
n-1
.A.txt
DC
U.v
ine
s.t
extD
escription.B
.testi…
DC
U.fused
.B.t
xt
NII
.ru
n-3
.B.txt
NII
.ru
n-3
.A.txt
NII
.ru
n-2
.B.txt
Sheff
ield
_U
ET
Laho
re.r
anklis
t.B
.te…
Sheff
ield
_U
ET
Laho
re.r
anklis
t.A
.te…
NII
.ru
n-4
.B.txt
DC
U.fused
.A.t
xt
Sheff
ield
_U
ET
Laho
re.r
anklis
t.B
.te…
Sheff
ield
_U
ET
Laho
re.r
anklis
t.A
.te…
INF
.epoch-3
8.B
.txt
INF
.epoch-3
8.A
.txtM
ean
In
vert
ed
Ran
k
Submitted runs
MediaMill
Vireo
Etter
DCU
INF(ormedia)
NII
Sheffield
![Page 17: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/17.jpg)
Matching & Ranking results by run17TRECVID 2016
0
0.02
0.04
0.06
0.08
0.1
0.12m
edia
mill
_ta
sk1_
se
t.B
.run2
.txt
me
dia
mill
_ta
sk1_
se
t.B
.run4
.txt
me
dia
mill
_ta
sk1_
se
t.B
.ru
n3
.txt
me
dia
mill
_ta
sk1_
se
t.B
.run1
.txt
me
dia
mill
_ta
sk1_
se
t.A
.run2
.txt
me
dia
mill
_ta
sk1_
se
t.A
.run3
.txt
me
dia
mill
_ta
sk1_
se
t.A
.run1
.txt
me
dia
mill
_ta
sk1_
se
t.A
.run4
.txt
vireo
_fu
sin
g_
all.
B.txt
vireo
_fu
sin
g_
fla
t.A
.txt
vireo
_fu
sin
g_
fla
t.B
.txt
vireo
_fu
sin
g_
avera
ge
.B.t
xt
vireo
_fu
sin
g_
all.
A.txt
vireo
_fu
sin
g_
avera
ge
.A.t
xt
vireo
_concep
t.B
.txt
vireo
_concep
t.A
.txt
ett
er_
mand
r.B
.1
ett
er_
mand
r.B
.2
ett
er_
mand
r.A
.1
DC
U.a
da
pt.b
m25
.B.s
wape
d.txt
ett
er_
mand
r.A
.2
DC
U.a
da
pt.b
m25
.A.s
wape
d.txt
INF
.ra
nked_
list.
B.n
o_score
.txt
DC
U.a
da
pt.fu
sio
n.B
.sw
aped
.txt.tx
t
DC
U.a
da
pt.fu
sio
n.A
.sw
aped
.txt.tx
t
INF
.ra
nked_
list.
A.n
o_score
.txt
INF
.ra
nked_
list.
A.n
ew
.txt
INF
.ra
nked_
list.
B.n
ew
.txt
NII
.ru
n-2
.A.txt
DC
U.v
ine
s.t
extD
escription.A
.testi…
NII
.ru
n-4
.A.txt
NII
.ru
n-1
.B.txt
NII
.ru
n-1
.A.txt
DC
U.v
ine
s.t
extD
escription.B
.testi…
DC
U.fused
.B.t
xt
NII
.ru
n-3
.B.txt
NII
.ru
n-3
.A.txt
NII
.ru
n-2
.B.txt
Sheff
ield
_U
ET
Laho
re.r
anklis
t.B
.te…
Sheff
ield
_U
ET
Laho
re.r
anklis
t.A
.te…
NII
.ru
n-4
.B.txt
DC
U.fused
.A.t
xt
Sheff
ield
_U
ET
Laho
re.r
anklis
t.B
.te…
Sheff
ield
_U
ET
Laho
re.r
anklis
t.A
.te…
INF
.epoch-3
8.B
.txt
INF
.epoch-3
8.A
.txtM
ean
In
vert
ed
Ran
k
Submitted runs
‘B’ runs
(colored/team)
seem to be doing
better than ‘A’
MediaMill
Vireo
Etter
DCU
INF(ormedia)
NII
Sheffield
![Page 18: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/18.jpg)
Runs vs. matches
18TRECVID 2016
0
100
200
300
400
500
600
700
800
900
1 2 3 4 5 6 7 8 9 10
Matc
hes n
ot
fou
nd
by r
un
s
Number of runs that missed a match
All matches were found by different runs
5 runs didn’t find
any of 805
matches
![Page 19: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/19.jpg)
Matched ranks frequency across all runs
19TRECVID 2016
0
100
200
300
400
500
600
700
800
1
10
19
28
37
46
55
64
73
82
91
100
Nu
mb
er
of
matc
hes
Rank 1 - 100
Set ‘B’
0
100
200
300
400
500
600
700
800
1
10
19
28
37
46
55
64
73
82
91
100
Nu
mb
er
of
matc
hes
Rank 1 - 100
Set ‘A’ Very similar rank distribution
![Page 20: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/20.jpg)
20TRECVID 2016
Videos vs. Ranks
1 10 100 1000 10000
Rank
Vid
eo
s
Top 10 ranked & matched videos (set A)
626
1816
1339
1244
1006
527
1201
1387
1271
324
#Video Id
![Page 21: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/21.jpg)
21TRECVID 2016
Videos vs. Ranks
1 10 100 1000
Rank
Vid
eo
s
Top 3 ranked & matched videos (set A)
1387 (Top 3)
1271 (Top 2)
324 (Top1)
#Video Id
![Page 22: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/22.jpg)
Samples of top 3 results (set A)
TRECVID 2016 22
#1271
a woman and a man are kissing each other#1387
a dog imitating a baby by crawling on the floor
in a living room
#324
a dog is licking its nose
![Page 23: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/23.jpg)
23TRECVID 2016
Videos vs. Ranks
1 10 100 1000 10000
Rank
Vid
eo
s
Bottom 10 ranked & matched videos (set A)
220
732
1171
481
1124
579
754
443
1309
1090
#Video Id
![Page 24: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/24.jpg)
24TRECVID 2016
Videos vs. Ranks
1 10 100 1000 10000
Rank
Vid
eo
s
Bottom 3 ranked & matched videos (set A)
220
732
1171
#Video Id
![Page 25: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/25.jpg)
Samples of bottom 3 results (set A)
TRECVID 2016 25
#1171
3 balls hover in front of a man
#220
2 soccer players are playing rock-paper-scissors
on a soccer field
#732
a person wearing a costume and holding a chainsaw
![Page 26: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/26.jpg)
26TRECVID 2016
Videos vs. Ranks
1 10 100 1000 10000
Rank
Vid
eo
s
Top 10 ranked & matched videos (set B)
1128
40
374
752
955
777
1366
1747
387
761
#Video Id
![Page 27: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/27.jpg)
27TRECVID 2016
Videos vs. Ranks
1 10 100 1000 10000
Rank
Vid
eo
s
Top 3 ranked & matched videos (set B)
1747
387
761
#Video Id
![Page 28: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/28.jpg)
Samples of top 3 results (set B)
TRECVID 2016 28
#761
White guy playing the guitar in a room#387
An Asian young man sitting is eating something yellow
#1747
a man sitting in a room is giving baby something
to drink and it starts laughing
![Page 29: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/29.jpg)
29TRECVID 2016
Videos vs. Ranks
1 10 100 1000 10000
Rank
Vid
eo
s
Bottom 10 ranked & matched videos (set B)
1460
674
79
345
1475
605
665
414
1060
144
#Video Id
![Page 30: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/30.jpg)
30TRECVID 2016
Videos vs. Ranks
1 10 100 1000 10000
Rank
Vid
eo
s
Bottom 3 ranked & matched videos (set B)
414
1060
144
#Video Id
![Page 31: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/31.jpg)
Samples of bottom 3 results (set B)
TRECVID 2016 31
#144
A man touches his chin in a tv show
#1060
A man piggybacking another man outdoors
#414
a woman is following a man walking on the street at
daytime trying to talk with him
![Page 32: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/32.jpg)
Lessons Learned ?
• Can we say something about A vs B
• At the top end we’re not so bad … best results can find
the correct caption in almost top 1% of ranking
TRECVID 2016 32
![Page 33: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/33.jpg)
Task 2: Description Generation
TRECVID 2016 33
“a dog is licking its nose”
Given a video
Generate a textual description
Metrics• Popular MT measures : BLEU , METEOR• Semantic similarity measure (STS).• All runs and GT were normalized (lowercase,
punctuations, stop words, stemming) before evaluation by MT metrics (except STS)
Who ? What ? Where ? When ?
![Page 34: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/34.jpg)
BLEU results
0
0.005
0.01
0.015
0.02
0.025
BL
EU
sco
re
Overall system scores
TRECVID 2016 34
INF(ormedia)
Sheffield
NII
MediaMill
DCU
![Page 35: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/35.jpg)
BLEU stats sorted by median value
0
0.2
0.4
0.6
0.8
1
1.2
BL
EU
sco
re
BLEU stats across 2000 videos per run
Min
Max
Median
TRECVID 2016 35
![Page 36: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/36.jpg)
METEOR results
0
0.05
0.1
0.15
0.2
0.25
0.3
ME
TE
OR
sco
re
Overall system score
TRECVID 2016 36
INF(ormedia)
Sheffield
NII
MediaMill
DCU
![Page 37: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/37.jpg)
METEOR stats sorted by median value
0
0.2
0.4
0.6
0.8
1
1.2
ME
TE
OR
sco
re
METEOR stats across 2000 videos per run
Min
Max
Median
TRECVID 2016 37
![Page 38: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/38.jpg)
Semantic Similarity (STS) sorted by
median value
0
0.2
0.4
0.6
0.8
1
1.2
ST
S s
co
re
STS stats across 2000 videos per run
Min
Max
Median
TRECVID 2016 38
‘A’ runs
seems to
be doing
better
than ‘BMediamill(A)
INF(A)
Sheffield_UET(A)
NII(A)DCU(A)
![Page 39: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/39.jpg)
STS(A, B) Sorted by STS value
TRECVID 2016 39
![Page 40: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/40.jpg)
An example from run submissions
– 7 unique examples
1. a girl is playing with a baby
2. a little girl is playing with a dog
3. a man is playing with a woman in a
room
4. a woman is playing with a baby
5. a man is playing a video game and
singing
6. a man is talking to a car
7. A toddler and a dog
TRECVID 2016 40
![Page 41: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/41.jpg)
Participants
• High level descriptions of what groups did from their
papers … more details on posters
TRECVID 2016 41
![Page 42: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/42.jpg)
Participant: DCU
Task A: Caption Matching
• Preprocess 10 frames/video to detect 1,000 objects
(VGG-16 CNN from ImageNet), 94 crowd behaviour
concepts (WWW dataset), locations (Place2 dataset on
VGG16)
• 4 runs, baseline BM25, Word2vec, and fusion
Task B: Caption Generation
• Train on MS-COCO using NeuralTalk2, a RNN
• One caption per keyframe, captions then fused
TRECVID 2016 42
![Page 43: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/43.jpg)
Participant: Informedia
Focus on generalization ability of caption models, ignoring
Who, What, Where, When facets
Trained 4 caption models on 3 datasets (MS-COCO, MS-
VD, MSR-VTT), achieving sota on those models based on
VGGNet concepts and Hierarchical Recurrent Neural
Encoder for temporal aspects
Task B: Caption Generation
• Results explore transfer models to TRECVid-VTT
TRECVID 2016 43
![Page 44: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/44.jpg)
Participant: MediaMill
Task A: Caption Matching
Task B: Caption Generation
TRECVID 2016 44
![Page 45: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/45.jpg)
Participant: NII
Task A: Caption Matching
• 3DCNN for video representation trained on MSR-VTT +
1,970 YouTube2Text + 1M captioned images
• 4 run variants submitted, concluding the approach did not
generalise well on test set and suffers from over-fitting
Task B: Caption Generation
• Trained on 6,500 videos from MSR-VTT dataset
• Confirmed that multimodal feature fusion works best, with
audio features surprisingly good
TRECVID 2016 45
![Page 46: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/46.jpg)
Participant: Sheffield / Lahore
Task A: Caption Matching
Did some run
Task B: Caption Generation
• Identified a variety of high level concepts for frames
• Detect and recognise, faces, age and gender, emotion,
objects, (human) actions
• Varied the frequency of frames for each type of
recognition
• Runs based on combinations of feature types
TRECVID 2016 46
![Page 47: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/47.jpg)
Participant: VIREO (CUHK)
Adopted their zero-example MED system in reverse
Used a concept bank of 2,000 concepts trained on MSR-
VTT, Flickr30k, MS-COCO and TGIF datasets
Task A: Caption Matching
• 4(+4) runs testing traditional concept-based approach vs
attention-based deep models, finding deep models
perform better, motion features dominate performance
TRECVID 2016 47
![Page 48: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/48.jpg)
Participant: Etter Solutions
Task A: Caption Matching
• Focused on concepts for Who, What, When, Where
• Used a subset of ImageNet plus scene categories from
the Places database
• Applied concepts to 1 fps with sliding window, mapped
this to “document” vector, and calculated similarity score
TRECVID 2016 48
![Page 49: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/49.jpg)
Observations
• Good participation, good finishing %, ‘B’ runs did better than ‘A’ in matching &
ranking while ‘A’ did better than ‘B’ in the semantic similarity.
• METEOR scores are higher than BLEU, we should have used CIDEr also (some
participants did)
• STS as a metric has some questions, making us ask what makes more sense?
MT or semantic similarity ? Which metric measures real system performance in a
realistic application ?
• Lots of available training sets, some overlap ... MSR-VTT, MS-COCO, Place2,
ImageNet, YouTube2Text, MS-VD .. Some trained with AMT (MSR-VTT-10k has
10,000 videos, 41.2 hours and 20 annotations each !)
• What did individual teams learn ?
• Do we need more reference (GT) sets ? (good for MT metrics)
• Should we run again as pilot ? How many videos to annotate, how many
annotations on each?
• Only some systems applied the 4-facet description in their submissions ?
TRECVID 2016 49
![Page 50: TRECVID 2016 : Video to Text Description](https://reader031.vdocuments.net/reader031/viewer/2022022202/588145b81a28abf65a8b701d/html5/thumbnails/50.jpg)
Observations
• There are other video-to-caption challenges like ACM
MULTIMEDIA 2016 Grand Challenges
• Images from YFCC100N with captions in a caption-
matching/prediction task for 36,884 test images. Majority
of participants used CNNs and RNNs
• Video MSR VTT with 41.2h, 10,000 clips each with x20
AMT captions … evaluation measures BLEU, METEOR,
CIDEr and ROUGE-L ... GC results do not get aggregated
and disssipate at the ACM MM Conference, so hard to
gauge.
TRECVID 2016 50