mediaeval 2015 - multimodal person discovery in broadcast tv

41
Multimodal Person Discovery in Broadcast TV Johann Poignant / Hervé Bredin / Claude Barras 2015 [email protected] herve.niderb.fr @hbredin

Upload: multimediaeval

Post on 13-Apr-2017

134 views

Category:

Education


0 download

TRANSCRIPT

Page 1: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

MultimodalPerson Discovery

in Broadcast TV

Johann Poignant / Hervé Bredin / Claude Barras

2015

[email protected] herve.niderb.fr @hbredin

Page 2: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Outline• Motivations

• Definition of the task

• Datasets

• Baseline & Metadata

• Evaluation protocol

• Organization

• Results

• Conclusion

2

Page 3: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Motivations

3

Page 4: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

"the usual suspects"

• Huge TV archives are useless if not searchable

• People ❤ people

• Need for person-based indexes

4

Page 5: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

REPERE

• Evaluation campaigns in 2012, 2013 and 2014 Three French consortia funded by ANR

• Multimodal people recognition in TV documents "who speaks when?" and "who appears when?"

• Led to significant progress in both supervised and unsupervised multimodal person recognition

5

Page 6: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

From REPERE to Person Discovery

• Speaking faces Focus on "person of interest"

• Unsupervised approaches People may not be "famous" at indexing time

• Evidence Archivist/journalist use case

6

Page 7: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Definition of the task

7

Page 8: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Input

8

Broadcast TV pre-segmented into shots.

Page 9: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Speaking face

9

Tag each shot with the names of people both speaking and appearing at the same time

Page 10: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Person discovery

• Prior biometric models are not allowed.

• Person names must be discovered automatically in text overlay or speech utterances.

unsupervised approaches only

10

Page 11: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Evidence

11

Associate each name with a unique shotprooving that the person actually holds this name

Page 12: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Evidence (cont.)

12

an image evidence is a shot during which a person is visible and their name is written on screen

an audio evidence is a shot during which a person is visible and their name is pronounced at least once during a [shot start time - 5s, shot end time + 5s] neighborhood

shot #3for B

shot #1for A

Page 13: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Datasets

13

Page 14: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

DatasetsDEV | REPERE TEST | INA

14

137 hours

two French TV channels eight different types

106 hours (172 videos)

only one French TV channel only one type (news)

dense audio annotations speaker diarization speech transcription

sparse video annotations face detection & recognition optical character recognition

no prior annotation

a posteriori collaborative annotation

Page 15: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

DatasetsTEST | INA

15

106 hours (172 videos)

only one French TV channel only one type (news)

no prior annotation

a posteriori collaborative annotation

http://dataset.ina.fr

Page 16: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Evaluation protocol

16

Page 17: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Information retrieval task• Queries formatted as firstname_lastname

e.g. francois_hollande "return all shots where François Hollande is speaking and visible at the same time"

• Approximate search among submitted names e.g. francois_holande

• Select shots tagged with the most similar name

17

Page 18: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Evidence-weighted MAP

18

To ensure that participants followed this strict“no biomet-ric supervision” constraint, each hypothesized name n 2 Nhad to be backed up by a carefully selected and unique shotprooving that the person actually holds this name n: wecall this an evidence and denote it by E : N 7! S. In real-world conditions, this evidence would help a human anno-tator double-check the automatically-generated index, evenfor people they did not know beforehand.

Two types of evidence were allowed: an image evidenceis a shot during which a person is visible, and their name iswritten on screen; an audio evidence is a shot during which aperson is visible, and their name is pronounced at least onceduring a [shot start time� 5s, shot end time+5s] neighbor-hood. For instance, in Figure 1, shot #1 is an image evi-dence for Mr A (because his name and his face are visiblesimultaneously on screen) while shot #3 is an audio evi-dence for Mrs B (because her name is pronounced less than5 seconds before or after her face is visble on screen).

3. DATASETSThe REPERE corpus – distributed by ELDA – served

as development set. It is composed of various TV shows(around news, politics and people) from two French TVchannels, for a total of 137 hours. A subset of 50 hours ismanually annotated. Audio annotations are dense and pro-vide speech transcripts and identity-labeled speech turns.Video annotations are sparse (one image every 10 seconds)and provide overlaid text transcripts and identity-labeledface segmentation. Both speech and overlaid text transcriptsare tagged with named entities. The test set – distributedby INA – contains 106 hours of video, corresponding to 172editions of evening broadcast news “Le 20 heures” of Frenchpublic channel “France 2”, from January 1st 2007 to June30st 2007.

As the test set came completely free of any annotation, itwas annotated a posteriori based on participants’ submis-sions. In the following, task groundtruths are denoted byfunction L : S 7! P(N) that maps each shot s to the setof names of every speaking face it contains, and functionE : S 7! P(N) that maps each shot s to the set of personnames for which it actually is an evidence.

4. BASELINE AND METADATAThis task targeted researchers from several communities

including multimedia, computer vision, speech and naturallanguage processing. Though the task was multimodal bydesign and necessitated expertise in various domains, thetechnological barriers to entry was lowered by the provi-sion of a baseline system described in Figure 2 and availableas open-source software1. For instance, a researcher fromthe speech processing community could focus its research ef-forts on improving speaker diarization and automatic speechtranscription, while still being able to rely on provided facedetection and tracking results to participate to the task.The audio stream was segmented into speech turns, while

faces were detected and tracked in the visual stream. Speechturns (resp. face tracks) were then compared and clus-tered based on MFCC and the Bayesian Information Cri-terion [10] (resp. HOG [11] and Logistic Discriminant Met-ric Learning [17] on facial landmarks [31]). The approachproposed in [27] was also used to compute a probabilistic

1

http://github.com/MediaEvalPersonDiscovery

Figure 2: Multimodal baseline pipeline. Output ofgreyed out modules is provided to the participants.

mapping between co-occuring faces and speech turns. Writ-ten (resp. pronounced) person names were automaticallyextracted from the visual stream (resp. the audio stream)using open source LOOV Optical Character Recognition [24](resp. Automatic Speech Recognition [21, 12]) followed byNamed Entity detection (NE). The fusion module was a two-steps algorithm: propagation of written names onto speakerclusters [26] followed by propagation of speaker names ontoco-occurring speaking faces.

5. EVALUATION METRICThis information retrieval task was evaluated using a vari-

ant of Mean Average Precision (MAP), that took the qual-ity of evidences into account. For each query q 2 Q ⇢ N(firstname_lastname), the hypothesized person name nq

with the highest Levenshtein ratio ⇢ to the query q is se-lected (⇢ : N ⇥ N 7! [0, 1]) – allowing approximate nametranscription:

nq = argmaxn2N

⇢ (q, n) and ⇢q = ⇢ (q, nq)

Average precision AP(q) is then computed classically basedon relevant and returned shots:

relevant(q) = {s 2 S | q 2 L(s)}returned(q) = {s 2 S | nq 2 L(s)}

sorted by

confidence

Proposed evidence is Correct if name nq is close enough tothe query q and if shot E(nq) actually is an evidence for q:

C(q) =

(1 if ⇢q > 0.95 and q 2 E(E(nq))

0 otherwise

To ensure participants do provide correct evidences for everyhypothesized name n 2 N , standard MAP is altered intoEwMAP (Evidence-weighted Mean Average Precision), theo�cial metric for the task:

EwMAP =1|Q|

X

q2QC(q) ·AP(q)

Acknowledgment. This work was supported by the FrenchNational Agency for Research under grant ANR-12-CHRI-0006-01. The open source CAMOMILE collaborative anno-tation platform2 was used extensively throughout the progressof the task: from the run submission script to the automatedleaderboard, including a posteriori collaborative annotationof the test corpus. We thank ELDA and INA for supportingthe task by distributing development and test datasets.

2

http://github.com/camomile-project

MAP =1

| Q |X

q2QAP(q)

C(q) measures the correctness of provided evidence for query q

Page 19: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Baseline & Metadata

19

Page 20: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

the "multi" in multimediaTask necessitates expertise in various domains:• multimedia• computer vision• speech processing• natural language processing

20

Technological barriers to entry is loweredby the provision of a baseline system.

github.com/MediaevalPersonDiscoveryTask

Page 21: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Baseline fusion

21

Page 22: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

22

face tracking • •face clustering • • • •speaker diarization • • • • •optical character recognition • • •automatic speech recognition

speaking face detection • • • •fusion • • • • • •

Was the baseline useful?

team relied on the baseline module

• team developed their own module

• team tried both

9 participants

Page 23: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

23

focus onmonomodalcomponents

base

line

face tracking • •face clustering • • • •speaker diarization • • • • •optical character recognition • • •automatic speech recognition

speaking face detection • • • •fusion • • • • • •

focus onmultimodal

fusion

Page 24: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Organization

24

Page 25: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Schedule01.05 development set release 01.06 test set release 01.07 "out-domain" submission deadline 01.07 — 08.07 leaderboard updated every 6 hours 08.07 "in-domain" submission deadline 09.07 — 28.07 collaborative annotation 28.07 test set annotation release 28.07 — 28.08 adjudication

25

Page 26: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Leaderboard

26

computed on a secretsubset of the test set

updated every 6 hours

private leaderboard participants know how they rank participants know their score but do not know the score of others

Page 27: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Collaborative annotation

27

Page 28: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Thanks!

28

around 50 hours in total

Page 29: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Thanks!

29

half of the test set ✔

Page 30: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

collaborative annotation platform-projectgithub.com/

objectives open source

serv

er

clie

nt

clie

nt

clie

nt bring

yourown

client

JSO

N

data modelcorpus

medium layer

annotationmedium fragment attached metadata

time

Lorem ipsum dolor sit amet, consecteturadipisicing elit, sed do eiusmod temporincididunt ut labore et dolore magnaaliqua. Ut enim ad minim veniam, quisnostrud exercitation ullamco laboris nisiut aliquip ex ea commodo consequat.

multimediadocument

REST APIGET /corpus/:idcorpusPOST /corpus/:idcorpus/mediumPUT /layer/:idlayer/annotationDEL /annotation/:idannotation

obtain information about a corpus

add a medium to a corpus

update an annotation

delete an annotation

and more...

permissionsannotation historyannotation queue

user authentication

provide an architecture for effective creation and sharing of annotations of multimedia, multimodal and multilingual data

homogeneouscollection ofmultimediadocuments

homogeneouscollection ofannotationsaudio

videoimagetext

categorical valuenumerical value

free textraw content

enable novel collaborative and interactive annotation tools

be generic enough to supportas many use cases as possible

online documentationhttp://camomile-project.github.io/camomile-server

JSO

N

JSO

N

ADVERTISEMENT

Page 31: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Results

31

Page 32: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Stats

• 14 (+ organizers) registered participants received dev and test data

• 8 (+ organizers) participants submitted 70 runs

• 7 (+ organizers) submitted a working note paper

• 7 (+ organizers) attend the workshop

32

Page 33: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

33

out-domain results

28k shots 2070 queries 1642 # vs. 428 $

EwMAP (%) for PRIMARY runs

Page 34: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

34

out-domain results

28k shots 2070 queries 1642 # vs. 428 $

EwMAP (%) for PRIMARY runs

Page 35: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

in-domain results

35

EwMAP (%) for PRIMARY runs

Page 36: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

in-domain results

36

EwMAP (%) for PRIMARY runs

Page 37: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

37

out- vs. in-domain

Page 38: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

# people that no other team discovered

38

1200 01 4 0 0 0

speech transcripts?

0

anchors

EwMAP (%) for PRIMARY runs

Page 39: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Conclusion• Had a great (and exhausting) time organizing this task

• The winning submission DID NOT make any use of the face modality.

• (Almost) nobody used ASRMost people are introduced by overlaid names in the test set

• No cross-show approaches

• Poster session later today at 16:00Technical retreat tomorrow morning at 9:15

39

Page 40: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Next (oral)• Mateusz Budnik / LIG at MediaEval 2015

Multimodal Person Discovery in Broadcast TV Task

• Meriem Bendris / PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

• Rosalia Barros / GTM-UVigo Systems for Person Discovery Task at MediaEval 2015

40

Page 41: MediaEval 2015 - Multimodal Person Discovery in Broadcast TV

Next (poster)• Claude Barras / Multimodal Person Discovery in Broadcast TV at

MediaEval 2015

• Johann Poignant / LIMSI at MediaEval 2015: Person Discovery in Broadcast TV Task

• Paula Lopez Otero / GTM-UVigo Systems for Person Discovery Task at MediaEval 2015

• Javier Hernando / UPC System for the 2015 MediaEval Multimodal Person Discovery in Broadcast TV task

• Guillaume Gravier / SSIG and IRISA at Multimodal Person Discovery

• Nam Le / EUMSSI team at the MediaEval Person Discovery Challenge

41