an effective approach for searching closest sentence translations from the web ju fan, guoliang li,...
TRANSCRIPT
An Effective Approach for Searching Closest Sentence Translations from The Web
Ju Fan, Guoliang Li, and Lizhu Zhou
Database Research Group, Tsinghua University
DASFAA 2011 – Apr. 23, Hong Kong
DatabaseResearch
Group
OutlineOutline
• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion
04/21/23 2SCST@DASFAA 2011
OutlineOutline
• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion
04/21/23 3SCST@DASFAA 2011
BackgroundBackground
• Parallel sentences on the Web▪Sentences with the well-translated
counterpart▪An English-to-Chinese Example
• A rich source for translation• Commercial Systems
04/21/23 4SCST@DASFAA 2011
Obama said he hopes to get Congress to approve it next year奥巴马总统说他争取让希望国会明年批准该协议。 -- blog.hjenglish.com
Obama said he hopes to get Congress to approve it next year奥巴马总统说他争取让希望国会明年批准该协议。 -- blog.hjenglish.com
Parallel Sentences
E.g.,The result is
good结果很好
Parallel Sentences
E.g.,The result is
good结果很好
BackgroundBackground
04/21/23 5SCST@DASFAA 2011
Parallel SentenceDatabase
Sen 1 (E-C)Sen2 (E-C)
Sen3 (E-C)
sen n (E-C)
……
Closest Sentenceswith Translation
QuerySentence(English)
Web
Parallel SentenceDiscovery and Extraction
Sentence-Level Translation Aid
Sentence Matching
An effective similarity model between sentences in the source language (e.g., English sentences)
Research Issue
MotivationMotivation
04/21/23 6SCST@DASFAA 2011
• Existing approaches:▪ Word-based, e.g., translation model, edit
distance, …
▪ Gram-based, e.g., N-gram, V-gram
▪ All subsequences of a sentence
Cannot capture the order of words
Don’t consider the syntactic information
Too expensive
We propose a phrase-based similarity model1.Syntactic information 2.Frequency information3.Lengths of phrases
OutlineOutline
• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion
04/21/23 7SCST@DASFAA 2011
Problem DefinitionProblem Definition
04/21/23 8SCST@DASFAA 2011
DataData: : A Database of A Database of Parallel SentencesParallel Sentences
TranslatorTranslator
QueryQuery: : Query Sentence (Query Sentence (EnglishEnglish))
AnswerAnswer::Sentences with its translationsSentences with its translations
…
Sentence1: English - Chinese
Sentence2: English - Chinese
Sentence3: English - Chinese
Phrase-Based Sentence MatchingPhrase-Based Sentence Matching
04/21/23 9SCST@DASFAA 2011
q
Phrase f1
Phrase f2
Phrase fn
……
sPhrase f’1
Phrase f’2
Phrase f’n
……
SimilarityModel
SimilarityModel
Parallel SentencesParallel SentencesPhrase
SelectionPhrase
Selection
Phrase DatabasePhrase Database
OfflineOffline
OnlineOnline
OutlineOutline
• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion
04/21/23 10SCST@DASFAA 2011
Phrase-Based Similarity ModelPhrase-Based Similarity Model
04/21/23 11SCST@DASFAA 2011
q
Phrase f1
Phrase f2
Phrase fn
……
sPhrase f’1
Phrase f’2
Phrase f’n
……
SimilarityModel
SimilarityModel
Parallel SentencesParallel SentencesPhrase
SelectionPhrase
Selection
Phrase DatabasePhrase Database
OfflineOffline
OnlineOnline
Similarity ModelSimilarity Model
04/21/23 12SCST@DASFAA 2011
sim(q,s) = ∑f ∈Fq∩Fs φ(q,f) φ(s,f)
Query Sentence, qQuery Sentence, q
A Sentence in the DB, sA Sentence in the DB, s
PhrasePhraseSet, Set, FFqq
PhrasePhraseSet, Set, FFss
f1, f2, f3, ……, fm
f'1', f'2, f'3, ……, f'n
w(f)
φ(q,f):syntactic importance of f to q
φ(s,f):syntactic importance of f to s
Shared Phrases:
f ∈Fq∩Fs Shared Phrases:
f ∈Fq∩Fs w(f):weight of f
(IDF)
Fq∩Fs
Fs
Syntactic Importance of PhrasesSyntactic Importance of Phrases
04/21/23 13SCST@DASFAA 2011
φ(q,f)
Sentence Sentence qq
Phrase Phrase ff
He has eaten an apple
he eaten apple
= Πm α m Πg β g
has anGapGap
Dependency TreeDependency Tree
eateneaten
hehe appleapple hashas
anan
α0
d·α0 d·α0 d·α0
d2·α0d: a decay factor
β g : penalty(constant)
α m : syntactic weight of matched term
Features of the Similarity ModelFeatures of the Similarity Model
• More General▪Subsumes Jaccard, Cosine similarity,…
• Syntactic Information▪Weight of matched terms▪Weight of terms in the gap
• Frequency Information▪Weight of phrases
04/21/23 14SCST@DASFAA 2011
OutlineOutline
• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion
04/21/23 15SCST@DASFAA 2011
High-Quality Phrase SelectionHigh-Quality Phrase Selection
04/21/23 16SCST@DASFAA 2011
q
Phrase f1
Phrase f2
Phrase fn
……
sPhrase f’1
Phrase f’2
Phrase f’n
……
SimilarityModel
SimilarityModel
Parallel SentencesParallel SentencesPhrase
SelectionPhrase
Selection
Phrase DatabasePhrase Database
OfflineOffline
OnlineOnline
High-Quality PhraseHigh-Quality Phrase
• Extend grams by allowing discontinuous terms• A heuristic for selecting phrases
▪ Gap constraint: syntactic relationship of discontinuous terms
▪ Frequency constraint: infrequent (large IDF)▪ Maximum constraint: 1) not a prefix; 2) max. length
04/21/23 17SCST@DASFAA 2011
He has eaten an appleSentence Sentence qq
he eaten apple
syntactic
Frequency# of sentences
In the DB having it
Phrase SelectionPhrase Selection
• Selecting phrases with gap and maximum constraints
04/21/23 18SCST@DASFAA 2011
He ate a red appleSentence Sentence ss
hehe eateat redred appleapple
Sentence Graph1)Sequential relationship2)Syntactic relationship
• Longest path from a node = A phrase satisfying• Gap constraint• Maximum constraint
Phrase SelectionPhrase Selection
04/21/23 19SCST@DASFAA 2011
• Select phrases with frequency constraint (Threshold = 2)
Sentences in the DB
He has an apple
He ate a red apple
He has a pencil
He has
N0(8)
N1(4)
N2(3)
N27(1) N4(1)
N28(0) N5(0)
he
have
pencil apple
# #
N9(1)
eat
N11(1)
red
N15(1)
apple
N13(1)
apple
#
N14(0)
haveeat
red
…
……
Use a
frequency trie
N29(0)
#
Prune freq-uent phrases
OutlineOutline
• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion
04/21/23 20SCST@DASFAA 2011
Experiment SetupExperiment Setup
• Data Sets▪DI: 520,899 parallel sentences from ICIBA
▪DC: 800,000 parallel sentences from CNKI
• Baseline Methods▪ Jaccard Coefficient, Edit Distance, Cosine
Similarity▪Translation Model Methods (TM)▪Cosine Similarity with VGRAM
04/21/23 21SCST@DASFAA 2011
Experiment SetupExperiment Setup
• Evaluation Metrics▪BLEU
◦ A well known metric for machine translation◦ Example:
▪Precision◦ A user study to label whether the translations are
useful
04/21/23 22SCST@DASFAA 2011
qq: He has eaten an apple: He has eaten an apple
ss: He has a pencil: He has a pencil
他吃了一个苹果他吃了一个苹果他有一支铅笔他有一支铅笔
Ref. Translation
Translation
BLEU
Effects of Phrase SelectionEffects of Phrase Selection
04/21/23 23SCST@DASFAA 2011
Effect on max. length on DI Effect on freq. threshold on DC
Comparison with Similarity ModelsComparison with Similarity Models
04/21/23 24SCST@DASFAA 2011
Comparison on the DI data set
Comparison with Existing MethodsComparison with Existing Methods
04/21/23 25SCST@DASFAA 2011
Comparison on the DC data set
User StudiesUser Studies
• Methods used in commercial systems
04/21/23 26SCST@DASFAA 2011Comparison on the DI data set
OutlineOutline
• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion
04/21/23 27SCST@DASFAA 2011
ConclusionConclusion
• Searching closest sentence translations from the Web
• A phrase-based sentence similarity model
• High-quality phrase selection methods
• Extensive experiments and user studies
04/21/23 28SCST@DASFAA 2011
04/21/23 SCST@DASFAA 2011 29
Thanks
My Homepage: http://dbgroup.cs.tsinghua.edu/fanju
Frequency ConstraintFrequency Constraint
• Index structures▪Phrase Sentence
• Frequent phrases large inverted index
04/21/23 30SCST@DASFAA 2011