overview of the tdt-2003 evaluation and results
DESCRIPTION
Overview of the TDT-2003 Evaluation and Results. Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002. Outline. TDT Evaluation Overview TDT-2003 Evaluation Result Summaries New Event Detection Topic Detection Topic Tracking Link Detection Other Investigations. - PowerPoint PPT PresentationTRANSCRIPT
Overview of the TDT-2003Evaluation and Results
Jonathan Fiscus
NIST
Gaithersburg, Maryland
November 17-18, 2002
Outline
TDT Evaluation Overview
TDT-2003 Evaluation Result SummariesNew Event Detection
Topic Detection
Topic Tracking
Link Detection
Other Investigations
TDT 101“Applications for organizing text”
5 TDT ApplicationsStory Segmentation
Topic Tracking
Topic Detection
New Event Detection
Link Detection
Terabytes of Unorganized data
TDT’s Research Domain
Technology challengeDevelop applications that organize and locate relevant stories from a continuous feed of news stories
Research driven by evaluation tasks
Composite applications built fromAutomatic Speech Recognition
Story Segmentation
Document Retrieval
Definitions
An event is …A specific thing that happens at a specific time and place along with all necessary preconditions and unavoidable consequences.
A topic is …an event or activity, along with all directly related events and activities
A broadcast news story is …a section of transcribed text with substantive information content and a unified topical focus
TDT-02 Evaluation CorpusTDT4 Corpus
TDT4 Corpus used for last year’s evaluationOctober 1, 2000 to January 31, 200120 sources:
• 8 English, 5 Arabic, 7 Mandarin Chinese
90735 news, 7513 non-news stories80 annotated topics
• 40 topics from 2002• 40 new topics
See LDC’s presentation for more details
What was new in 2002
40 new topicsSame number of “On-Topic” stories
20, 10, 10 seed stories for Arabic, English and Mandarin respectively.
Much more Arabic “On-Topic” stories
Large influence on scores
0500
100015002000250030003500
Arabic English Mandarin
Number of On-Topic Stories
2002 Topics 2003 Topics 2002+2003 topics
Participants
Carnegie Mellon Univ. (CMU)Royal Melbourne Insititute of Technology (RMIT)Stottler Henke Associates, Inc. (SHAI)Univ. Massachusetts (UMass)
New Event
Topic Detection Topic Tracking
Link Detection
CMU 2 2 6 11
RMIT 1 2
SHAI 10
UMass 8 3 18 17
TDT Evaluation Methodology
Evaluation tasks are cast as detection tasks:YES there is a target, or NO there is not
Performance is measured in terms of detection cost: “a weighted sum of missed detection and false alarm probabilities”
CDet = CMiss * PMiss * Ptarget + CFA * PFA * (1- Ptarget)
CMiss = 1 and CFA = 0.1 are preset costs
Ptarget = 0.02 is the a priori probability of a target
TDT Evaluation Methodology(cont’d)
Detection Cost is normalized to generally lie between 0 and 1:
(CDet)Norm = CDet/min{CMiss*Ptarget, CFA * (1-Ptarget)}When based on the YES/NO decisions, it is referred to as the actual decision cost
Detection Error Tradeoff (DET) curves graphically depict the performance tradeoff between PMiss and PFA
Makes use of likelihood scores attached to the YES/NO decisionsMinimum DET point is the best score a system could achieve with proper thresholds
TDT: Experimental ControlGood research requires experimental controlsConditions that affect performance in TDT
Newswire vs. Broadcast newsManual vs. automatic transcription of Broadcast NewsManual vs. automatic story segmentationMono vs. multilingual language materialTopic training amounts and languagesDefault, automatic English translation vs. native orthographyDecision deferral periods
Outline
TDT Evaluation Overview
TDT-02 Evaluation Result SummariesNew Event Detection (NED)
Topic Detection
Topic Tracking
Link Detection
Other Investigations
New Event Detection Task
System Goal:To detect each new event discussing each topic for the first time
• Evaluating “part” of a Topic Detection system,I.e., when to start a new cluster
New Event on two topics
Not First Stories of Events
= Topic 1= Topic 2
TDT-03 Primary NED ResultsSR=nwt+bnasr TE=eng,nat boundary DEF=10
0
0.2
0.4
0.6
0.8
1
CMU1 SHAI1 UMass1
ActualMinimum
Primary NED Results2002 vs. 2003 Topics
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
CMU1 SHAI1 UMass1
2002 Topics 2003 Topics
Topic Detection Task
System Goal:To detect topics in terms of the (clusters of) stories that discuss them.
• “Unsupervised” topic training
• New topics must be detected as the incoming stories are processed
• Input stories are then associated with one of the topics
Topic 1
Topic 2
Story Stream
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
RMIT1
Nor
mal
ize
Det
. Cos
t
TDT-03 Topic Detection ResultsMultilingual Sources, English Translations, Reference Boundaries, 10
File Deferral Period
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
CMU1 UMass3
Nor
mal
ize
Det
. Cos
t
Newswire+BNews ASR Newswire+BNews Manual Trans
Not a primary system
Topic Tracking TaskSystem Goal:
To detect stories that discuss the target topic, in multiple source streams
• Supervised Training Given Nt samples stories that discuss a given target topic
• Testing Find all subsequent stories that discuss the target topic
training data
test data
on-topicunknownunknown
TDT-03 Primary TRK ResultsNewswire+BNews Human Trans., Multilingual sources, English Translations, Reference Boundaries, 1 Training story, 0 Negative Training Stories
0
0.2
0.4
0.6
0.8
1
Minimum Actual
Newswire + BNews Human Trans.,
Nt=1 Nn=0
Newswire+ BNews ASR, Nt=1 Nn=0
RMIT1 UMass01 CMU1
Primary Topic Tracking Results2002 vs. 2003 Topics
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
RMIT1 UMass01 CMU1
2002 Topics 2003 Topics
Minimum DET Cost
Link Detection Task
System Goal:To detect whether a pair of stories discuss the same topic.(Can be though of as a “primitive operator” to build a variety of applications)
?
TDT-03 Primary LNK ResultsNewswire+BNews ASR, Multilingual Sources, English or Native Translations, Reference Boundaries, 10 File Deferral Period
0
0.2
0.4
CMU1Eng.
Trans.
CMU1NativeTrans.
UMass01Eng.
Trans.
ActualMinimum
TDT-03 Primary LNK Results 2002 vs. 2003 Topics
00.050.1
0.150.2
0.250.3
0.350.4
0.45
Eng2Eng
Man
2Man
Arb2A
rb
Eng2Arb
Eng2M
an
Man
2Arb
2002-cmu12003-cmu1
Topic Weighted, Minimum DET CostUMass01CMU1
00.050.1
0.150.2
0.250.3
0.350.4
0.45
Eng2Eng
Man
2Man
Arb2A
rb
Eng2Arb
Eng2M
an
Man
2Arb
2002-umass012003-umass01
Outline
TDT Evaluation Overview
2002 TDT Evaluation Result SummariesNew Event Detection (NED)
Topic Detection
Topic Tracking
Link Detection
Other Investigations
Other Investigations
History of performance
Evaluation Performance History
Link Detectionyear condition site score
1999SR=nwt+bnasr TE=eng,nat DEF=10
CMU1 1.0943
2000SR=nwt+bnasr TE=eng+man,eng boundary DEF=10
UMass1 .3134
2001 “ “ CMU1 .2421
2002SR=nwt+bnasr TE=eng+man+arb, eng boundary DEF=10
PARC1 .1947
2003SR=nwt+bnasr TE=eng+man+arb, eng boundary DEF=10
UMass01 .1839*
* 0.1798 on 2002 Topics
Evaluation Performance History
Trackingyear condition site score
1999SR=nwt+bnasr TR=eng TE=eng+man,eng boundary NT=4
BBN1 .0922
2000SR=nwt+bnman TR=eng TE=eng+man,eng boundary NT=1_Nn=0
IBM1 .1248
2001 “ “ LIMSI1 .1213
2002SR=nwt+bnman TR=engTE=eng+man+arb, eng boundary Nt=1 Nn=0
UMass1 .1647
2003SR=nwt+bnman TR=engTE=eng+man+arb, eng boundary Nt=1 Nn=0
UMass1 .1949*
* 0.1618 on 2002 Topics
Evaluation Performance History
Topic Detectionyear condition site score
1999SR=nwt+bnasr TE=eng+man,eng boundary DEF=10
IBM1 .2645
2000SR=nwt+bnasr TE=eng+man,eng noboundary DEF=10
Dragon1 .3326
2001 “ “TNO1(late)
.3551
2002SR=nwt+bnasr TE=eng+man+arb, eng boundary DEF=10
UMass1 .2021
2003 “” CMU1 .3035*
* 0.3007 on 2002 Topics
Evaluation Performance History
New Event Detectionyear condition site score
1999SR=nwt+bnasr TE=eng,nat boundary DEF=10
UMass1 .8110
2000SR=nwt+bnasr TE=eng,nat noboundary DEF=10
UMass1 .7581
2001 “ “ UMass1 .7729
2002SR=nwt+bnasr TE=eng,nat boundary DEF=10
CMU1 .4449
2003 “” CMU1 .5971*
* 0.4283 on 2002 Topics
Summary and Issues to DiscussTDT Evaluation Overview2003 TDT Evaluation Results2002 vs. 2003 topic sets are very different
2003 set was weighted more towards ArabicDramatic increase in error rates with new topics; link detection, topic tracking and new event detectionNeed to calculate the effect of topic set on topic detection
TDT 2004Release 2003 topics and TDT4 corpus? Ensure 2004 evaluation will support Go/No Go decisionsWhat tasks will 2004 include?