the evolution of shared-task evaluation douglas w. oard college of information studies and umiacs...

20
The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4, 2013 FIRE

Upload: clement-barrett

Post on 11-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,

The Evolution of Shared-Task Evaluation

Douglas W. OardCollege of Information Studies and UMIACS

University of Maryland, College Park, USA

December 4, 2013 FIRE

Page 2: The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,

The Story

• Evaluation-guided research

• The three C’s

• Five examples

• Thinking forward

Page 3: The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,

Evaluation-Guided Research

• Information Retrieval• Text classification• Automatic Speech Recognition• Optical Character Recognition• Named Entity Recognition• Machine Translation• Extractive summarization• …

Page 4: The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,

Key Elements

• Task model

• Single-valued evaluation measure

• Affordable evaluation process

Page 5: The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,

Critiques

• Early convergence

• Duplicative ($)

• Incrementalism

• Privileging the measurable

Page 6: The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,

The Big Four

• TREC

• NTCIR

• CLEF

• FIRE

Page 7: The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,

10 More• TDT• Amarylis• INEX• TRECVid• TAC• MediaEval• STD• OAEI• CONLL• WePS

Page 8: The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,

What We Create

• Collections

• Comparison points– Baseline results

• Communities

• Competition?

Page 9: The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,

Elsewhere in the Ecosystem …• Capacity

– From universities, industry, individuals, and funding agencies

• Completed work– Often requires working outside our year-long

innovation cycles with rigid timelines

• Culling– Conferences and journals are the guardians

of community standards

Page 10: The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,

A Typical Task Life Cycle

• Year 1: – Task definition– Evaluation design– Community building

• Year 2:– Creating training data

• Year 3:– Reusable test collection– Establishing strong baselines

Page 11: The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,

Some Sea Stories

• TDT

• CLIR

• Speech Retrieval

• E-Discovery

Page 12: The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,

Topic Detection and Tracking

• Cultures– Speech, sponsor

• Event-based relevance

• Document boundary discovery

• Complexity– 5 tasks, 3 languages, 2 modalities

• Lasting influence

Page 13: The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,

Cross-Language IR

• TREC CLIR (Arabic)– Standard resources– Light stemming– Problematic task model

• CLEF Interactive CLIR– Controlled user studies– Problematic evaluation design– Qualitative vs. quantitative

Page 14: The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,

Speech Retrieval

• TREC Spoken Document Retrieval– The “solved problem”

• CLEF Cross-Language Speech Retrieval– Grounded queries– Start time error evaluation measure

• FIRE QA for the Spoken Web

Page 15: The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,

TREC Legal Track

• Iterative task design

• Sampling

• Measurement error

• Families

• Cultures

Page 16: The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,

What’s in a Test Collection?

• Queries

• Documents

• Relevance judgments

Page 17: The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,

What’s in a Test Collection?

• Queries

• Content

• Units of judgment

• Relevance judgments

• Evaluation measure(s)

Page 18: The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,

Personality Types

• Innovators

• Organizers

• Optimizers

• Deployers

• Resourcers

Page 19: The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,

Some Takeaways

• Progressive invalidation

• Social engineering

• Innovation from outside

Page 20: The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,

A Final Thought

It isn’t what you don’t know that limits your thinking.

Rather, it is what you know that isn’t true.