the evolution of shared-task evaluation douglas w. oard college of information studies and umiacs...
TRANSCRIPT
The Evolution of Shared-Task Evaluation
Douglas W. OardCollege of Information Studies and UMIACS
University of Maryland, College Park, USA
December 4, 2013 FIRE
The Story
• Evaluation-guided research
• The three C’s
• Five examples
• Thinking forward
Evaluation-Guided Research
• Information Retrieval• Text classification• Automatic Speech Recognition• Optical Character Recognition• Named Entity Recognition• Machine Translation• Extractive summarization• …
Key Elements
• Task model
• Single-valued evaluation measure
• Affordable evaluation process
Critiques
• Early convergence
• Duplicative ($)
• Incrementalism
• Privileging the measurable
The Big Four
• TREC
• NTCIR
• CLEF
• FIRE
10 More• TDT• Amarylis• INEX• TRECVid• TAC• MediaEval• STD• OAEI• CONLL• WePS
What We Create
• Collections
• Comparison points– Baseline results
• Communities
• Competition?
Elsewhere in the Ecosystem …• Capacity
– From universities, industry, individuals, and funding agencies
• Completed work– Often requires working outside our year-long
innovation cycles with rigid timelines
• Culling– Conferences and journals are the guardians
of community standards
A Typical Task Life Cycle
• Year 1: – Task definition– Evaluation design– Community building
• Year 2:– Creating training data
• Year 3:– Reusable test collection– Establishing strong baselines
Some Sea Stories
• TDT
• CLIR
• Speech Retrieval
• E-Discovery
Topic Detection and Tracking
• Cultures– Speech, sponsor
• Event-based relevance
• Document boundary discovery
• Complexity– 5 tasks, 3 languages, 2 modalities
• Lasting influence
Cross-Language IR
• TREC CLIR (Arabic)– Standard resources– Light stemming– Problematic task model
• CLEF Interactive CLIR– Controlled user studies– Problematic evaluation design– Qualitative vs. quantitative
Speech Retrieval
• TREC Spoken Document Retrieval– The “solved problem”
• CLEF Cross-Language Speech Retrieval– Grounded queries– Start time error evaluation measure
• FIRE QA for the Spoken Web
TREC Legal Track
• Iterative task design
• Sampling
• Measurement error
• Families
• Cultures
What’s in a Test Collection?
• Queries
• Documents
• Relevance judgments
What’s in a Test Collection?
• Queries
• Content
• Units of judgment
• Relevance judgments
• Evaluation measure(s)
Personality Types
• Innovators
• Organizers
• Optimizers
• Deployers
• Resourcers
Some Takeaways
• Progressive invalidation
• Social engineering
• Innovation from outside
A Final Thought
It isn’t what you don’t know that limits your thinking.
Rather, it is what you know that isn’t true.