overview of the tdt-2003 evaluation and results

30
Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002

Upload: miron

Post on 02-Feb-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Overview of the TDT-2003 Evaluation and Results. Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002. Outline. TDT Evaluation Overview TDT-2003 Evaluation Result Summaries New Event Detection Topic Detection Topic Tracking Link Detection Other Investigations. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Overview of the TDT-2003 Evaluation and Results

Overview of the TDT-2003Evaluation and Results

Jonathan Fiscus

NIST

Gaithersburg, Maryland

November 17-18, 2002

Page 2: Overview of the TDT-2003 Evaluation and Results

Outline

TDT Evaluation Overview

TDT-2003 Evaluation Result SummariesNew Event Detection

Topic Detection

Topic Tracking

Link Detection

Other Investigations

Page 3: Overview of the TDT-2003 Evaluation and Results

TDT 101“Applications for organizing text”

5 TDT ApplicationsStory Segmentation

Topic Tracking

Topic Detection

New Event Detection

Link Detection

Terabytes of Unorganized data

Page 4: Overview of the TDT-2003 Evaluation and Results

TDT’s Research Domain

Technology challengeDevelop applications that organize and locate relevant stories from a continuous feed of news stories

Research driven by evaluation tasks

Composite applications built fromAutomatic Speech Recognition

Story Segmentation

Document Retrieval

Page 5: Overview of the TDT-2003 Evaluation and Results

Definitions

An event is …A specific thing that happens at a specific time and place along with all necessary preconditions and unavoidable consequences.

A topic is …an event or activity, along with all directly related events and activities

A broadcast news story is …a section of transcribed text with substantive information content and a unified topical focus

Page 6: Overview of the TDT-2003 Evaluation and Results

TDT-02 Evaluation CorpusTDT4 Corpus

TDT4 Corpus used for last year’s evaluationOctober 1, 2000 to January 31, 200120 sources:

• 8 English, 5 Arabic, 7 Mandarin Chinese

90735 news, 7513 non-news stories80 annotated topics

• 40 topics from 2002• 40 new topics

See LDC’s presentation for more details

Page 7: Overview of the TDT-2003 Evaluation and Results

What was new in 2002

40 new topicsSame number of “On-Topic” stories

20, 10, 10 seed stories for Arabic, English and Mandarin respectively.

Much more Arabic “On-Topic” stories

Large influence on scores

0500

100015002000250030003500

Arabic English Mandarin

Number of On-Topic Stories

2002 Topics 2003 Topics 2002+2003 topics

Page 8: Overview of the TDT-2003 Evaluation and Results

Participants

Carnegie Mellon Univ. (CMU)Royal Melbourne Insititute of Technology (RMIT)Stottler Henke Associates, Inc. (SHAI)Univ. Massachusetts (UMass)

New Event

Topic Detection Topic Tracking

Link Detection

CMU 2 2 6 11

RMIT 1 2

SHAI 10

UMass 8 3 18 17

Page 9: Overview of the TDT-2003 Evaluation and Results

TDT Evaluation Methodology

Evaluation tasks are cast as detection tasks:YES there is a target, or NO there is not

Performance is measured in terms of detection cost: “a weighted sum of missed detection and false alarm probabilities”

CDet = CMiss * PMiss * Ptarget + CFA * PFA * (1- Ptarget)

CMiss = 1 and CFA = 0.1 are preset costs

Ptarget = 0.02 is the a priori probability of a target

Page 10: Overview of the TDT-2003 Evaluation and Results

TDT Evaluation Methodology(cont’d)

Detection Cost is normalized to generally lie between 0 and 1:

(CDet)Norm = CDet/min{CMiss*Ptarget, CFA * (1-Ptarget)}When based on the YES/NO decisions, it is referred to as the actual decision cost

Detection Error Tradeoff (DET) curves graphically depict the performance tradeoff between PMiss and PFA

Makes use of likelihood scores attached to the YES/NO decisionsMinimum DET point is the best score a system could achieve with proper thresholds

Page 11: Overview of the TDT-2003 Evaluation and Results

TDT: Experimental ControlGood research requires experimental controlsConditions that affect performance in TDT

Newswire vs. Broadcast newsManual vs. automatic transcription of Broadcast NewsManual vs. automatic story segmentationMono vs. multilingual language materialTopic training amounts and languagesDefault, automatic English translation vs. native orthographyDecision deferral periods

Page 12: Overview of the TDT-2003 Evaluation and Results

Outline

TDT Evaluation Overview

TDT-02 Evaluation Result SummariesNew Event Detection (NED)

Topic Detection

Topic Tracking

Link Detection

Other Investigations

Page 13: Overview of the TDT-2003 Evaluation and Results

New Event Detection Task

System Goal:To detect each new event discussing each topic for the first time

• Evaluating “part” of a Topic Detection system,I.e., when to start a new cluster

New Event on two topics

Not First Stories of Events

= Topic 1= Topic 2

Page 14: Overview of the TDT-2003 Evaluation and Results

TDT-03 Primary NED ResultsSR=nwt+bnasr TE=eng,nat boundary DEF=10

0

0.2

0.4

0.6

0.8

1

CMU1 SHAI1 UMass1

ActualMinimum

Page 15: Overview of the TDT-2003 Evaluation and Results

Primary NED Results2002 vs. 2003 Topics

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

CMU1 SHAI1 UMass1

2002 Topics 2003 Topics

Page 16: Overview of the TDT-2003 Evaluation and Results

Topic Detection Task

System Goal:To detect topics in terms of the (clusters of) stories that discuss them.

• “Unsupervised” topic training

• New topics must be detected as the incoming stories are processed

• Input stories are then associated with one of the topics

Topic 1

Topic 2

Story Stream

Page 17: Overview of the TDT-2003 Evaluation and Results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

RMIT1

Nor

mal

ize

Det

. Cos

t

TDT-03 Topic Detection ResultsMultilingual Sources, English Translations, Reference Boundaries, 10

File Deferral Period

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

CMU1 UMass3

Nor

mal

ize

Det

. Cos

t

Newswire+BNews ASR Newswire+BNews Manual Trans

Not a primary system

Page 18: Overview of the TDT-2003 Evaluation and Results

Topic Tracking TaskSystem Goal:

To detect stories that discuss the target topic, in multiple source streams

• Supervised Training Given Nt samples stories that discuss a given target topic

• Testing Find all subsequent stories that discuss the target topic

training data

test data

on-topicunknownunknown

Page 19: Overview of the TDT-2003 Evaluation and Results

TDT-03 Primary TRK ResultsNewswire+BNews Human Trans., Multilingual sources, English Translations, Reference Boundaries, 1 Training story, 0 Negative Training Stories

0

0.2

0.4

0.6

0.8

1

Minimum Actual

Newswire + BNews Human Trans.,

Nt=1 Nn=0

Newswire+ BNews ASR, Nt=1 Nn=0

RMIT1 UMass01 CMU1

Page 20: Overview of the TDT-2003 Evaluation and Results

Primary Topic Tracking Results2002 vs. 2003 Topics

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

RMIT1 UMass01 CMU1

2002 Topics 2003 Topics

Minimum DET Cost

Page 21: Overview of the TDT-2003 Evaluation and Results

Link Detection Task

System Goal:To detect whether a pair of stories discuss the same topic.(Can be though of as a “primitive operator” to build a variety of applications)

?

Page 22: Overview of the TDT-2003 Evaluation and Results

TDT-03 Primary LNK ResultsNewswire+BNews ASR, Multilingual Sources, English or Native Translations, Reference Boundaries, 10 File Deferral Period

0

0.2

0.4

CMU1Eng.

Trans.

CMU1NativeTrans.

UMass01Eng.

Trans.

ActualMinimum

Page 23: Overview of the TDT-2003 Evaluation and Results

TDT-03 Primary LNK Results 2002 vs. 2003 Topics

00.050.1

0.150.2

0.250.3

0.350.4

0.45

Eng2Eng

Man

2Man

Arb2A

rb

Eng2Arb

Eng2M

an

Man

2Arb

2002-cmu12003-cmu1

Topic Weighted, Minimum DET CostUMass01CMU1

00.050.1

0.150.2

0.250.3

0.350.4

0.45

Eng2Eng

Man

2Man

Arb2A

rb

Eng2Arb

Eng2M

an

Man

2Arb

2002-umass012003-umass01

Page 24: Overview of the TDT-2003 Evaluation and Results

Outline

TDT Evaluation Overview

2002 TDT Evaluation Result SummariesNew Event Detection (NED)

Topic Detection

Topic Tracking

Link Detection

Other Investigations

Page 25: Overview of the TDT-2003 Evaluation and Results

Other Investigations

History of performance

Page 26: Overview of the TDT-2003 Evaluation and Results

Evaluation Performance History

Link Detectionyear condition site score

1999SR=nwt+bnasr TE=eng,nat DEF=10

CMU1 1.0943

2000SR=nwt+bnasr TE=eng+man,eng boundary DEF=10

UMass1 .3134

2001 “ “ CMU1 .2421

2002SR=nwt+bnasr TE=eng+man+arb, eng boundary DEF=10

PARC1 .1947

2003SR=nwt+bnasr TE=eng+man+arb, eng boundary DEF=10

UMass01 .1839*

* 0.1798 on 2002 Topics

Page 27: Overview of the TDT-2003 Evaluation and Results

Evaluation Performance History

Trackingyear condition site score

1999SR=nwt+bnasr TR=eng TE=eng+man,eng boundary NT=4

BBN1 .0922

2000SR=nwt+bnman TR=eng TE=eng+man,eng boundary NT=1_Nn=0

IBM1 .1248

2001 “ “ LIMSI1 .1213

2002SR=nwt+bnman TR=engTE=eng+man+arb, eng boundary Nt=1 Nn=0

UMass1 .1647

2003SR=nwt+bnman TR=engTE=eng+man+arb, eng boundary Nt=1 Nn=0

UMass1 .1949*

* 0.1618 on 2002 Topics

Page 28: Overview of the TDT-2003 Evaluation and Results

Evaluation Performance History

Topic Detectionyear condition site score

1999SR=nwt+bnasr TE=eng+man,eng boundary DEF=10

IBM1 .2645

2000SR=nwt+bnasr TE=eng+man,eng noboundary DEF=10

Dragon1 .3326

2001 “ “TNO1(late)

.3551

2002SR=nwt+bnasr TE=eng+man+arb, eng boundary DEF=10

UMass1 .2021

2003 “” CMU1 .3035*

* 0.3007 on 2002 Topics

Page 29: Overview of the TDT-2003 Evaluation and Results

Evaluation Performance History

New Event Detectionyear condition site score

1999SR=nwt+bnasr TE=eng,nat boundary DEF=10

UMass1 .8110

2000SR=nwt+bnasr TE=eng,nat noboundary DEF=10

UMass1 .7581

2001 “ “ UMass1 .7729

2002SR=nwt+bnasr TE=eng,nat boundary DEF=10

CMU1 .4449

2003 “” CMU1 .5971*

* 0.4283 on 2002 Topics

Page 30: Overview of the TDT-2003 Evaluation and Results

Summary and Issues to DiscussTDT Evaluation Overview2003 TDT Evaluation Results2002 vs. 2003 topic sets are very different

2003 set was weighted more towards ArabicDramatic increase in error rates with new topics; link detection, topic tracking and new event detectionNeed to calculate the effect of topic set on topic detection

TDT 2004Release 2003 topics and TDT4 corpus? Ensure 2004 evaluation will support Go/No Go decisionsWhat tasks will 2004 include?