mediaeval 2015 - multimodal person discovery in broadcast tv

MultimodalPerson Discovery

in Broadcast TV

Johann Poignant / Hervé Bredin / Claude Barras

bredin@limsi.fr herve.niderb.fr @hbredin

Outline• Motivations

• Definition of the task

• Datasets

• Baseline & Metadata

• Evaluation protocol

• Organization

• Results

• Conclusion

Motivations

"the usual suspects"

• Huge TV archives are useless if not searchable

• People ❤ people

• Need for person-based indexes

REPERE

• Evaluation campaigns in 2012, 2013 and 2014 Three French consortia funded by ANR

• Multimodal people recognition in TV documents "who speaks when?" and "who appears when?"

• Led to significant progress in both supervised and unsupervised multimodal person recognition

From REPERE to Person Discovery

• Speaking faces Focus on "person of interest"

• Unsupervised approaches People may not be "famous" at indexing time

• Evidence Archivist/journalist use case

Definition of the task

Broadcast TV pre-segmented into shots.

Speaking face

Tag each shot with the names of people both speaking and appearing at the same time

Person discovery

• Prior biometric models are not allowed.

• Person names must be discovered automatically in text overlay or speech utterances.

unsupervised approaches only

Evidence

Associate each name with a unique shotprooving that the person actually holds this name

Evidence (cont.)

an image evidence is a shot during which a person is visible and their name is written on screen

an audio evidence is a shot during which a person is visible and their name is pronounced at least once during a [shot start time - 5s, shot end time + 5s] neighborhood

shot #3for B

shot #1for A

Datasets

DatasetsDEV | REPERE TEST | INA

137 hours

two French TV channels eight different types

106 hours (172 videos)

only one French TV channel only one type (news)

dense audio annotations speaker diarization speech transcription

sparse video annotations face detection & recognition optical character recognition

no prior annotation

a posteriori collaborative annotation

DatasetsTEST | INA

106 hours (172 videos)

only one French TV channel only one type (news)

no prior annotation

a posteriori collaborative annotation

http://dataset.ina.fr

Evaluation protocol

Information retrieval task• Queries formatted as firstname_lastname

e.g. francois_hollande "return all shots where François Hollande is speaking and visible at the same time"

• Approximate search among submitted names e.g. francois_holande

• Select shots tagged with the most similar name

Evidence-weighted MAP

To ensure that participants followed this strict“no biomet-ric supervision” constraint, each hypothesized name n 2 Nhad to be backed up by a carefully selected and unique shotprooving that the person actually holds this name n: wecall this an evidence and denote it by E : N 7! S. In real-world conditions, this evidence would help a human anno-tator double-check the automatically-generated index, evenfor people they did not know beforehand.

Two types of evidence were allowed: an image evidenceis a shot during which a person is visible, and their name iswritten on screen; an audio evidence is a shot during which aperson is visible, and their name is pronounced at least onceduring a [shot start time� 5s, shot end time+5s] neighbor-hood. For instance, in Figure 1, shot #1 is an image evi-dence for Mr A (because his name and his face are visiblesimultaneously on screen) while shot #3 is an audio evi-dence for Mrs B (because her name is pronounced less than5 seconds before or after her face is visble on screen).

3. DATASETSThe REPERE corpus – distributed by ELDA – served

as development set. It is composed of various TV shows(around news, politics and people) from two French TVchannels, for a total of 137 hours. A subset of 50 hours ismanually annotated. Audio annotations are dense and pro-vide speech transcripts and identity-labeled speech turns.Video annotations are sparse (one image every 10 seconds)and provide overlaid text transcripts and identity-labeledface segmentation. Both speech and overlaid text transcriptsare tagged with named entities. The test set – distributedby INA – contains 106 hours of video, corresponding to 172editions of evening broadcast news “Le 20 heures” of Frenchpublic channel “France 2”, from January 1st 2007 to June30st 2007.

As the test set came completely free of any annotation, itwas annotated a posteriori based on participants’ submis-sions. In the following, task groundtruths are denoted byfunction L : S 7! P(N) that maps each shot s to the setof names of every speaking face it contains, and functionE : S 7! P(N) that maps each shot s to the set of personnames for which it actually is an evidence.

4. BASELINE AND METADATAThis task targeted researchers from several communities

including multimedia, computer vision, speech and naturallanguage processing. Though the task was multimodal bydesign and necessitated expertise in various domains, thetechnological barriers to entry was lowered by the provi-sion of a baseline system described in Figure 2 and availableas open-source software1. For instance, a researcher fromthe speech processing community could focus its research ef-forts on improving speaker diarization and automatic speechtranscription, while still being able to rely on provided facedetection and tracking results to participate to the task.The audio stream was segmented into speech turns, while

faces were detected and tracked in the visual stream. Speechturns (resp. face tracks) were then compared and clus-tered based on MFCC and the Bayesian Information Cri-terion [10] (resp. HOG [11] and Logistic Discriminant Met-ric Learning [17] on facial landmarks [31]). The approachproposed in [27] was also used to compute a probabilistic

http://github.com/MediaEvalPersonDiscovery

Figure 2: Multimodal baseline pipeline. Output ofgreyed out modules is provided to the participants.

mapping between co-occuring faces and speech turns. Writ-ten (resp. pronounced) person names were automaticallyextracted from the visual stream (resp. the audio stream)using open source LOOV Optical Character Recognition [24](resp. Automatic Speech Recognition [21, 12]) followed byNamed Entity detection (NE). The fusion module was a two-steps algorithm: propagation of written names onto speakerclusters [26] followed by propagation of speaker names ontoco-occurring speaking faces.

5. EVALUATION METRICThis information retrieval task was evaluated using a vari-

ant of Mean Average Precision (MAP), that took the qual-ity of evidences into account. For each query q 2 Q ⇢ N(firstname_lastname), the hypothesized person name nq

with the highest Levenshtein ratio ⇢ to the query q is se-lected (⇢ : N ⇥ N 7! [0, 1]) – allowing approximate nametranscription:

nq = argmaxn2N

⇢ (q, n) and ⇢q = ⇢ (q, nq)

Average precision AP(q) is then computed classically basedon relevant and returned shots:

relevant(q) = {s 2 S | q 2 L(s)}returned(q) = {s 2 S | nq 2 L(s)}

sorted by

confidence

Proposed evidence is Correct if name nq is close enough tothe query q and if shot E(nq) actually is an evidence for q:

C(q) =

(1 if ⇢q > 0.95 and q 2 E(E(nq))

0 otherwise

To ensure participants do provide correct evidences for everyhypothesized name n 2 N , standard MAP is altered intoEwMAP (Evidence-weighted Mean Average Precision), theo�cial metric for the task:

EwMAP =1|Q|

q2QC(q) ·AP(q)

Acknowledgment. This work was supported by the FrenchNational Agency for Research under grant ANR-12-CHRI-0006-01. The open source CAMOMILE collaborative anno-tation platform2 was used extensively throughout the progressof the task: from the run submission script to the automatedleaderboard, including a posteriori collaborative annotationof the test corpus. We thank ELDA and INA for supportingthe task by distributing development and test datasets.

http://github.com/camomile-project

MAP =1

| Q |X

q2QAP(q)

C(q) measures the correctness of provided evidence for query q

Baseline & Metadata

the "multi" in multimediaTask necessitates expertise in various domains:• multimedia• computer vision• speech processing• natural language processing

Technological barriers to entry is loweredby the provision of a baseline system.

github.com/MediaevalPersonDiscoveryTask

Baseline fusion

face tracking • •face clustering • • • •speaker diarization • • • • •optical character recognition • • •automatic speech recognition

speaking face detection • • • •fusion • • • • • •

Was the baseline useful?

team relied on the baseline module

• team developed their own module

• team tried both

9 participants

focus onmonomodalcomponents

face tracking • •face clustering • • • •speaker diarization • • • • •optical character recognition • • •automatic speech recognition

speaking face detection • • • •fusion • • • • • •

focus onmultimodal

fusion

Organization

Schedule01.05 development set release 01.06 test set release 01.07 "out-domain" submission deadline 01.07 — 08.07 leaderboard updated every 6 hours 08.07 "in-domain" submission deadline 09.07 — 28.07 collaborative annotation 28.07 test set annotation release 28.07 — 28.08 adjudication

Leaderboard

computed on a secretsubset of the test set

updated every 6 hours

private leaderboard participants know how they rank participants know their score but do not know the score of others

Collaborative annotation

Thanks!

around 50 hours in total

Thanks!

half of the test set ✔

collaborative annotation platform-projectgithub.com/

objectives open source

nt bring

yourown

client

data modelcorpus

medium layer

annotationmedium fragment attached metadata

Lorem ipsum dolor sit amet, consecteturadipisicing elit, sed do eiusmod temporincididunt ut labore et dolore magnaaliqua. Ut enim ad minim veniam, quisnostrud exercitation ullamco laboris nisiut aliquip ex ea commodo consequat.

multimediadocument

REST APIGET /corpus/:idcorpusPOST /corpus/:idcorpus/mediumPUT /layer/:idlayer/annotationDEL /annotation/:idannotation

obtain information about a corpus

add a medium to a corpus

update an annotation

delete an annotation

and more...

permissionsannotation historyannotation queue

user authentication

provide an architecture for effective creation and sharing of annotations of multimedia, multimodal and multilingual data

homogeneouscollection ofmultimediadocuments

homogeneouscollection ofannotationsaudio

videoimagetext

categorical valuenumerical value

free textraw content

enable novel collaborative and interactive annotation tools

be generic enough to supportas many use cases as possible

online documentationhttp://camomile-project.github.io/camomile-server

Results

• 14 (+ organizers) registered participants received dev and test data

• 8 (+ organizers) participants submitted 70 runs

• 7 (+ organizers) submitted a working note paper

• 7 (+ organizers) attend the workshop

out-domain results

28k shots 2070 queries 1642 # vs. 428 $

EwMAP (%) for PRIMARY runs

out-domain results

28k shots 2070 queries 1642 # vs. 428 $

in-domain results

out- vs. in-domain

# people that no other team discovered

1200 01 4 0 0 0

speech transcripts?

anchors

Conclusion• Had a great (and exhausting) time organizing this task

• The winning submission DID NOT make any use of the face modality.

• (Almost) nobody used ASRMost people are introduced by overlaid names in the test set

• No cross-show approaches

• Poster session later today at 16:00Technical retreat tomorrow morning at 9:15

Next (oral)• Mateusz Budnik / LIG at MediaEval 2015

Multimodal Person Discovery in Broadcast TV Task

• Meriem Bendris / PERCOLATTE: a multimodal person discovery system in TV broadcast for the Medieval 2015 evaluation campaign

• Rosalia Barros / GTM-UVigo Systems for Person Discovery Task at MediaEval 2015

Next (poster)• Claude Barras / Multimodal Person Discovery in Broadcast TV at

MediaEval 2015

• Johann Poignant / LIMSI at MediaEval 2015: Person Discovery in Broadcast TV Task

• Paula Lopez Otero / GTM-UVigo Systems for Person Discovery Task at MediaEval 2015

• Javier Hernando / UPC System for the 2015 MediaEval Multimodal Person Discovery in Broadcast TV task

• Guillaume Gravier / SSIG and IRISA at Multimodal Person Discovery

• Nam Le / EUMSSI team at the MediaEval Person Discovery Challenge

mediaeval 2015 - multimodal person discovery in broadcast tv

Education

mediaeval 2015 - recod@placing task of mediaeval 2015

journal of la multimodal person discovery in...

naselja iz eneolitika, ranoga i kasnoga · pdf filesrednjega...

mediaeval 2015 - recod @ mediaeval 2015: diverse social...

mediaeval 2014: a multimodal approach to drop detection in...

mediaeval 2015 - certh at mediaeval 2015 synchronization of...

mediaeval 2016: a multimodal system for the verifying...

mediaeval 2010 workshop -...

the mediaeval fair

mediaeval studies 78

overall approach - eprints · a unified, modular and...

mediaeval 2015 - uniza system for the "emotion in music"...

mediaeval 2015 - dmun at the mediaeval 2015 c@merata task:...

mediaeval church vaulting

mediaeval 2016 - uvigo system for multimodal person...

mediaeval 2015 - multi-scale approaches to the mediaeval...

multimodal named data discovery with interest broadcast

mediaeval life

mediaeval mason

mediaeval 2015 - multimodal-based diversified summarization...