translingual topic tracking with prise gina-anne levow and douglas w. oard university of maryland...
TRANSCRIPT
![Page 1: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/1.jpg)
Translingual Topic Tracking with PRISE
Gina-Anne Levow and Douglas W. Oard
University of Maryland
February 28, 2000
![Page 2: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/2.jpg)
Roadmap
• The signal to noise perspective
• Our topic tracking system
• Boosting signal
• Reducing noise
• Future directions
![Page 3: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/3.jpg)
Translingual Tracking Challenges
• Segmentation of text adds noise– Unknown words
• Transcription of speech adds noise– Unknown words– Easily confused words (e.g., homophones)
• Translation adds noise– Vocabulary mismatch with ASR / segmentation– Incorrect translation selection
![Page 4: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/4.jpg)
Improving the Signal to Noise Ratio
• Translation coverage– Enrich the term list using large dictionaries
• Translation selection– Statistical evidence from comparable corpora
• Enriching indexing vocabulary– Add related terms from comparable corpora
• Score normalization– Learn source dependence from dry-run collection
![Page 5: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/5.jpg)
Preview• Focusing on noise alone is not enough
– Signal boosting is a big win
• Baseline: Systran– Goal: choose the best single translation
• Two signal-boosting strategies beat Systran– Choose the best two translations– Add related terms for indexing
• (found in related documents)
![Page 6: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/6.jpg)
Improvements Since TDT-2
• Weight selection– PRISE “bm25idf”
• Query representation:– Vector of 180 most selective terms by χ² test
• Two-pass normalization– Source-specific, 5 source classes
• NYT, APW, Eng. Speech, Man. Text, Man. Speech
– Topic-specific• Average of example story scores
![Page 7: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/7.jpg)
Mandarin (All Sources)
English (All Sources)
Source-independent
Source-dependent
Source-independent
Source-dependent
![Page 8: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/8.jpg)
Translingual Approaches
• Indexing strategies (boosting signal)– Post-translation document expansion– n-best translation
• Translation tweaks (reducing noise)– Enriched bilingual term list– Corpus-based translation selection– Pre-translation Mandarin stopword removal
![Page 9: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/9.jpg)
Translingual Runs
Run Term ListSide
CorpusMandarinStopwords
DocumentExpansion
nBest
1 LDC Brown 12* Combined Brown 1
3 Combined TDT 1
4* Combined TDT Removed 1
5 Combined TDT Removed 2
6* Combined TDT Removed Yes 1
7 Systran 1
(* = official run scored by NIST)
![Page 10: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/10.jpg)
Document Expansion
BN NWT
Mandarin
Word-to-WordTranslation
Comp.EnglishCorpus
PRISE
Top 5
ASRTranscript
NMSUSegmenter
TermSelectionPRISE
BN NWT
English
Results
QueryVector
Documents to Index
Single Document
![Page 11: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/11.jpg)
Run Term listSideCorpus
MandarinStopwords
DocumentExpansion
nBest
4 Combined TDT Removed 16 Combined TDT Removed Applied 1
Mandarin Newswire Text
![Page 12: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/12.jpg)
Run Term listSideCorpus
MandarinStopwords
DocumentExpansion
nBest
4 Combined TDT Removed 16 Combined TDT Removed Applied 1
Mandarin Broadcast News
![Page 13: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/13.jpg)
Why Document Expansion Works
• Story-length objects provide useful context
• Ranked retrieval finds signal amid the noise
• Selective terms discriminate among documents– Enrich index with high IDF terms from top documents
• Similar strategies work well in other applications– TREC-7 SDR [Singhal et al., 1998]– CLIR query translation [Ballesteros & Croft, 1997]
![Page 14: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/14.jpg)
n-best Translation
• We generally used 1-best translation– Highest unigram frequency in comparable corpus
• Tried 2-best: two highest-ranked translations– Duplicating unique translations where necessary
• Should reduce miss rate– But at what cost in false alarms?
![Page 15: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/15.jpg)
Run Term listSideCorpus
MandarinStopwords
DocumentExpansion
nBest
4 Combined TDT Removed 15 Combined TDT Removed 2
Mandarin Newswire Text
![Page 16: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/16.jpg)
Run Term listSideCorpus
MandarinStopwords
DocumentExpansion
nBest
4 Combined TDT Removed 15 Combined TDT Removed 2
Mandarin Broadcast News
![Page 17: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/17.jpg)
Comparison With Systran
• Used baseline translations provided by LDC– Untranslated words not used– No document expansion
• Systran produces 1-best translations– Natural comparison is with our 2-best run
![Page 18: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/18.jpg)
Run Term listSideCorpus
MandarinStopwords
DocumentExpansion
nBest
7,7 Systran 15,5 Combined TDT Removed 2
Mandarin Newswire Text
![Page 19: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/19.jpg)
Run Term listSideCorpus
MandarinStopwords
DocumentExpansion
nBest
7,7 Systran 15,5 Combined TDT Removed 2
Mandarin Broadcast News
![Page 20: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/20.jpg)
Bilingual Term List Enrichment
• Two sources of candidate translations– LDC Chinese-English term list (version 2)– CETA (Optilex) dictionary
• >250K entries, hand-built from >250 sources
• Merging strategy– Used only general-purpose sources in CETA– Filtered out definitions– Removed parenthetical clauses
![Page 21: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/21.jpg)
Term List Statistics
Term List
Mandarin Headwords
Mandarin Entries
Combined 195,078 341,187 CETA 91,602 169,067 LDC 127,924 187,130
![Page 22: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/22.jpg)
Run Term listSideCorpus
MandarinStopwords
DocumentExpansion
nBest
1 LDC Brown 12 Combined Brown 1
Broadcast News
Newswire Text
![Page 23: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/23.jpg)
Translation Preference
• Unigram statistics guided translation selection– Minimize effect of rare translations, misspellings, …
• Based on dry run stories and rolling update– Backoff to balanced corpus for unknown words
• Brown corpus: variety of genres
• Compared with use of balanced corpus alone
![Page 24: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/24.jpg)
Run Term listSideCorpus
MandarinStopwords
DocumentExpansion
nBest
2 Combined Brown 13 Combined TDT 1
Mandarin Newswire Text
![Page 25: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/25.jpg)
Pre-Translation Stopword Removal
• Common words don’t help retrieval much– But mistranslations might hurt
• We built a Mandarin stopword list– Processed dictionary to identify function words– Added the top 300 words in LDC frequency list– Filtered by two speakers of Mandarin
• Suppressed translation of stopwords
![Page 26: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/26.jpg)
Run Term listSideCorpus
MandarinStopwords
DocumentExpansion
nBest
3 Combined TDT 14 Combined TDT Removed 1
Mandarin Newswire Text
![Page 27: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/27.jpg)
Summary
• 3 techniques produced improvements:– Source-dependent normalization – Post-translation document expansion– n-best translation
• 3 techniques had little effect:– Bilingual term list enrichment– Comparable-corpus-based translation preference– Pre-translation stopword removal
![Page 28: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/28.jpg)
Future Directions
• Statistical significance– Can this be added to the scoring software?
• Pre-translation document expansion– An effective approach in CLIR query translation
• Further experiments with n-best translation– Probably using a weighted strategy
• Structured translation [Pirkola, 1998]– Some concern about efficiency, though
![Page 29: Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000](https://reader035.vdocuments.net/reader035/viewer/2022070409/56649ea05503460f94ba3bfa/html5/thumbnails/29.jpg)
Where is the Perfect TDT System?
Run TDT-4In Nova Scotia!
Maryland
Penn BBN