beng update on automatic labelling
TRANSCRIPT
MM P05 automatic labelingterm extraction
Victor de Boer
Josefien Schuurman
Roeland Ordelman
• Input:– TT888 subtitles
• Output:– GTAA terms
• Onderwerpen
• Persoonsnamen
• Namen
• Geografische namen
– For entire video (corresponds to documentalist tasks)
Term extraction from TT888
• version 0.1 – `naive baseline’ – Test input andoutput
• version 0.2 – Multiple GTAA axes– Improve statistics– Bespreking met metadatabeheer
• version 0.3– More improvements – Evaluation
• version 1.0– To be reimplemented
Planning
http
://ww
w.recen
siekon
ing.n
l/20
11
/09
/48
92
8/o
nd
ertiteling
• Java to make integration easier
• XML and CSV outputs– URI of GTAA term – pref-label – Confidence value– Axis
• Input comes from Immix OAI API, where segmentation should already have taken place– Algorithm expects one OAI identifier (Expressie or Selectie)
• Matching with GTAA using ElasticSearch instance
Implementation details
For every item1. Get TT888 words in a frequency list 2. Discard stop words (‘de’, ‘het’, ‘op’, ‘naar’..)3. Take all words with freq > n 4. Match with GTAA “Onderwerpen” with ElasticSearch score > m
– Preflabel + altlabel
version 0.1
Algorithm
GTAA
gtaa:002151 “theater”
OAI
Stop words
Informal Evaluation:
Compare to hist labels (“Onderwerpen”)
Works a bit (< 20% correct). Input for version 0.2
version 0.1
Algorithm
GTAA
gtaa:002151 “theater”
OAI
Stop words
• Intermediate version, uses Named Entity Recognizer. Results discussed with Lisette and Vincent -> Version 0.3
version 0.2
Algorithm
GTAA
“theater”“Jos Brink”“Amsterdam”
OAI
Stop words
Named Entity Recognition
Word freq NL
• Webservice CLTL @ VU• Input:
– “Hallo, mijn naam is Victor de Boer en ik woon in de mooie stad Haarlem. Ik werk nu bij het Nederlands Instituut voor Beeld en Geluid in Hilversum. Hiervoor was ik werkzaam bij de Vrije Universiteit. “
• Output:[ Victor de Boer | PERSON ], [ Haarlem | LOCATION ], [ Nederlands | MISC ], [ Instituut voor Beeld en Geluid | ORGANIZATION ], [ Hilversum | LOCATION ], [ Vrije Universiteit | ORGANIZATION ]
Named Entity Recognition
For every item1. Track 1
1. Get TT888 words in a frequency list 2. Discard stop words (‘de’, ‘het’, ‘op’, ‘naar’..)3. Take all N-GRAMS with normalized frequency > n 4. Match with GTAA “Onderwerpen” with score > m
2. Track 21. Present TT888 to Named Entity Recognizer (VU-webservice)2. Match result (with freq > L) with GTAA “PersoonsNamen”, “Geografische
Namen”, “Onderwerpen”, “Namen”
version 0.3
Algorithm
GTAA
“theater”“Jos Brink”“Amsterdam”
OAI
Stop words
Named Entity Recognition
Word freq NL
version 0.3 > Example output
• Setup– 4 evaluators (Vincent, Lisette , Alma, Tim)
• 3 in one 50 min session
• 1 in another session
– ~8 minutes per item
– Video + extracted terms• Open Videos in IE browser
• GTAA URIS + preflabels
• Any other info allowed
– Five point Likert scale• Only precision, no recall
Evaluation
De gebruikte evaluatieschaal. 0 betekent echt
fout (bv een verkeerd homonym) of echt niet
relevant (verkeerd persoon). Aangezien hier
wisselwerking optreedt kan dit niet veel verder
uitgesplitst worden.
0: Term is geheel niet relevant
1: Term is niet relevant
2: Term is een beetje relevant
3: Term is relevant
4: Term is zeer relevant
Evaluation
• Total of 70 terms for 13 videos (5.4 term per vid)
– Some videos did not start-> discarded
– 38 terms with three evaluations
– 32 with one
Results
Results
eval_1 eval_2 eval_
3
eval_4 Avg
gem: F Term 2,59 1,35 2,00 2,37 2,08
item 1 6 licht 0 0 0 0
item 1 2 Friesland 0 0 2 0,666667
item 3 2 soul 0 1 1 0,666667
item 3 3 Romme,
Gianni
3 4 4 3,666667
item 3 2 Somerville,
Jimmy
4 2 2 2,666667
item 3 3 Harrison,
George
4 4 3 3,666667
item 3 4 Clapton, Eric 4 4 2 3,333333
item 3 2 Milwaukee 3 1 1 1,666667
• Term “Milwaukee”
– Top2000 a gogo
Example of disagreement
Eval 1-> score=3 “Term an sich niet heel relevant, maar in combinatie met Romme, Gianni toch waardevol. Alweer: NER wint aan kracht als user tijdcode meekrijgt en kan afspelen ter check of fragment relevant of niet is voor zijn zoekactie/hergebruik.”
Eval 3-> score=1“twee keer genoemd, niet relevant”
Eval 2-> score=1“…”
Pearson eval1 eval2 eval3
eval1 1
eval2 0,52 1
eval3 0,67 0,58 1
eval4 0.78 x 0.92
Inter-annotator agreement
Agreement between 3 and 4 is largebetween 1 and 4 is substantivebetween 1 and 2 , 1 and 3, 2 and 3 is lower but ok
Task is fairly objective, but somewhat subjectiveWe look mainly at averages for the rest
• Total average of 2.15 (“beetje relevant”+)
Results: average scores
At threshold of 2: Precision = 0.61
At threshold of 3: Precision = 0.36
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 0.2 0.4 0.6 0.8 1 1.2
Results per video
item averageitem 1 0,3333333
item 3 2,6111111
item 5 2,4444444
item 6 1,75
item 8 1,4
item 9 3,6666667
item 10 2,4545455
item 13 2.375
item 14 0(!)
item 15 1.33
item 17 2.08
item 19 4.00 (!)
item 20 1.67
• For some videos we shouldn’t do this
– Nederland in Beweging
– Metadata on Reeks-level
“Advies: Niveau 1 programma's uitsluiten van trefwoordextractie, ws. ook van NER”
• Correlation between frequency of term in text and average score
– No correlation (?)
Results correlation freq/score
• For some videos this shouldn’t be done– Game shows, drama..– Annotate at Reeks level
• Some axes seem to work better then others – Persoonsnamen, Namen, Geografische namen
• More abstraction or combination would be helpful– Semantic Clustering?
• Subtitles with * are song lyrics
• Still a need for time-coded terms
Evaluator remarks
• Limited evaluation
• But it works (prec 0.61)– With some tweaks to 0.7-0.8
• NEs lower threshold, Subjects higher
• Better Elasticsearch matching
– With semantic clustering to 0.8-0.9?
• Currently re-implemented by Arjen as a proper service
• Re-use for annotating program guides
Conclusion and current steps
A huge thanks to the annotators for their valuable effort!!
Questions?
antw
oo
rdn
u.n
l