the 2003 tides surprise language exercise

22
22 August 2003 CLEF 2003 The 2003 TIDES Surprise Language Exercise Douglas W. Oard University of Maryland

Upload: naida-alvarado

Post on 04-Jan-2016

24 views

Category:

Documents


2 download

DESCRIPTION

The 2003 TIDES Surprise Language Exercise. Douglas W. Oard University of Maryland. Outline. Thinking out of the box Some results Lesson Learned. Surprise Language Framework. Zero-resource start (treasure hunt) Time constrained (10 or 29 days) English Users / Documents in language X - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The 2003 TIDES  Surprise Language Exercise

22 August 2003 CLEF 2003

The 2003 TIDES Surprise Language Exercise

Douglas W. Oard

University of Maryland

Page 2: The 2003 TIDES  Surprise Language Exercise

Outline

• Thinking out of the box

• Some results

• Lesson Learned

Page 3: The 2003 TIDES  Surprise Language Exercise

Surprise Language Framework

• Zero-resource start (treasure hunt)

• Time constrained (10 or 29 days)

• English Users / Documents in language X

• Character-coded text

• Research-oriented

• Intensely collaborative (team-based)

Page 4: The 2003 TIDES  Surprise Language Exercise

Schedule

Cebuano• Announce: Mar 5• Test Data: • Stop Work: Mar 14• Newsletter: April• Talks: May 30

(HLT)• Papers:

Hindi

Jun 1

Jun 27

Jun 30

August

Aug 5 (TIDES PI)

Aug 15 (TALIP)

Page 5: The 2003 TIDES  Surprise Language Exercise

16 Participating TeamsCebuano and Hindi

ISI

Maryland

NYU

Johns Hopkins

Sheffield

LDC

CMU

UC Berkeley

MITRE

Hindi Only

U Mass

Alias-i

BBN

IBM

CUNY

KAT

SPAWAR

Page 6: The 2003 TIDES  Surprise Language Exercise

• Five evaluated tasks– Automatic CLIR (English queries)– Topic tracking (English examples, event-based)– Machine translation into English– English “Headline” generation– Entity tagging (five MUC types)

• Several useful components– POS tags, morphology, time expressions, parsing

• Several demonstration systems– Interactive CLIR (two systems)– Cross-language QA (English Q, Translated A)– Machine translation (+ Translation elicitation)– Cross-document entity tracking

Page 7: The 2003 TIDES  Surprise Language Exercise

Hindi Participants

Alias-I

UC

Berkeley

BB

N

CM

U

CU

NY

Johns Hopkins

IBM ISI

LDC

MIT

RE

NY

U

SP

AW

AR

U. S

heffield

U. M

assachusetts

U. M

aryland

ResourceGeneration

Detection

Extraction

Summarization

Translation

Page 8: The 2003 TIDES  Surprise Language Exercise

TranslationDetection

Extraction

Summarization

BooksWeb

Books

WebPeople

Lexicons

Corpora

Time

ResourceHarvesting

Systems

ResearchResults

CaptureProcess Knowledge

Innovation Cycle

Coordination

StrategyPushOrganizeTalk

Page 9: The 2003 TIDES  Surprise Language Exercise
Page 10: The 2003 TIDES  Surprise Language Exercise

The Synchronization Challenge

Page 11: The 2003 TIDES  Surprise Language Exercise

Cebuano MT Results

0 2 4 6 8 10 12

DDC

DCNDCNB

DBNDB

D5CN5BMDCNBM

DCMDBC

D5CN10BMDCNMDBCMDBNM

DNDBMDNM

DMDCN5BM

BLEU (%)

BibleCebuano bookDictMelamedNews

Page 12: The 2003 TIDES  Surprise Language Exercise

Cebuano Interactive CLIR

• Starting Point: iCLEF 2002 system (German)– Interface: “synonyms”/examples (parallel)/MT– Back end: InQuery/Pirkola’s method

• 3-day porting effort– Cebuano indexing (no stemming)– One-best gloss translation (bilingual term list)

• Informal Evaluation– 2 Cebuano native speakers (at ISI)

Page 13: The 2003 TIDES  Surprise Language Exercise

Hindi syntax is generally very “regular”• Subject – Object – Verb is the preferred order

– John saw Mary. = जॉ�न न� मे�री� को दे�खा ।• Presence of (occasionally deleted) case markers

often permit reordering– John saw Mary. = मे�री� को जॉ�न न� दे�खा ।

• English (or western) punctuation is pervasive in many modern texts– John said, “ I am here ” = जॉ�न न� कोहा , “ मे� यहा � हूँ�

• The subject may be omitted in some contexts– A: Where is John? B: [He] went home.– अ: जॉ�न कोहा � हा�? ब: [वहा] घरी चला गय ।

Page 14: The 2003 TIDES  Surprise Language Exercise

Hindi Encoding• Text encoding for storage and transmission and text

rendering for display and printing are separated

• Which syllable constituents get their own code-points?– Several 8-bit encodings:

• After assigning a code point to each stand-alone vowel and full consonant, and to half-consonants and vowels within a syllable, spare code-points get used for assorted/frequent CC clusters.

– Unicode UTF-16: Only stand-alone vowels, full consonants and vowels within syllables have their own code-points. All half consonants are realized by a `full consonant + halant’ sequence

• Choice of the “grammar” for syllable construction and rendering?– Several 8-bit encodings write the code-points in display order,

simplifying the rendering program– Unicode writes it in pronunciation order, making for a

considerably more complex display program

Page 15: The 2003 TIDES  Surprise Language Exercise

Hindi Week 1: Porting• Monday

– 2,973 BBC documents (UTF-8)– Batch CLIR (no stem, 2/3 known items rank 1)

• Tuesday– MIRACLE (“ITRANS”, gloss)– Stemmer (implemented from a paper)

• Wednesday– BBC CLIR collection (19 topic, known item)

• Friday:– Parallel text (Bible: 900k words, Web: 4k words) – Devanagari OCR system

Page 16: The 2003 TIDES  Surprise Language Exercise

Hindi Weeks 2/3/4: Exploration• N-grams (trigrams best for UTF-8)• Relative Average Term Frequency (Kwok)• Scanned bilingual dictionary (Oxford)• More topics for test collection (29)• Weighted structured queries (IBM lexicon)• Alternative stemmers (U Mass, Berkeley)• Blind relevance feedback• Transliteration• Noun phrase translation • MIRACLE integration (ISI MT, BBN headlines)

Page 17: The 2003 TIDES  Surprise Language Exercise
Page 18: The 2003 TIDES  Surprise Language Exercise
Page 19: The 2003 TIDES  Surprise Language Exercise

Formative Evaluation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 5 10 15 20 25 30

Day (=Date-1)

Mea

n R

ecip

roca

l R

ank

Page 20: The 2003 TIDES  Surprise Language Exercise

Transliteration

• Importance: Names, loan words– देक्षि�ण कोरिरीय (Dakshin Korea)

• Pronunciation crosswalk English->Hindi– English pronunciation (Festival)– Overgenerate Hindi characters (hand-built rules)

• Doctor => d aa k t ax r OR d ao k t ax r

– Rank n-best using bigrams (Hindi name list)

• Treat as alternate translations for CLIR– Pirkola’s method

Page 21: The 2003 TIDES  Surprise Language Exercise

Some Challenges

• Formative evaluation

• Synchronize variable-rate efforts– Soccer, not football

• Integration

• Capturing lessons learned– See the forest, not just the trees

Page 22: The 2003 TIDES  Surprise Language Exercise

For More Information

• TIDES Newsletter– Cebuano: April– Hindi: August

• Papers– NAACL/HLT Short paper– MT Summit (late Sep)– ACM TALIP Special Issue

• Demonstration systems– Contact individual sites