an effective approach for searching closest sentence translations from the web ju fan, guoliang li,...

Post on 16-Jan-2016

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

An Effective Approach for Searching Closest Sentence Translations from The Web

Ju Fan, Guoliang Li, and Lizhu Zhou

Database Research Group, Tsinghua University

DASFAA 2011 – Apr. 23, Hong Kong

DatabaseResearch

Group

OutlineOutline

• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion

04/21/23 2SCST@DASFAA 2011

OutlineOutline

• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion

04/21/23 3SCST@DASFAA 2011

BackgroundBackground

• Parallel sentences on the Web▪Sentences with the well-translated

counterpart▪An English-to-Chinese Example

• A rich source for translation• Commercial Systems

04/21/23 4SCST@DASFAA 2011

Obama said he hopes to get Congress to approve it next year奥巴马总统说他争取让希望国会明年批准该协议。 -- blog.hjenglish.com

Obama said he hopes to get Congress to approve it next year奥巴马总统说他争取让希望国会明年批准该协议。 -- blog.hjenglish.com

Parallel Sentences

E.g.,The result is

good结果很好

Parallel Sentences

E.g.,The result is

good结果很好

BackgroundBackground

04/21/23 5SCST@DASFAA 2011

Parallel SentenceDatabase

Sen 1 (E-C)Sen2 (E-C)

Sen3 (E-C)

sen n (E-C)

……

Closest Sentenceswith Translation

QuerySentence(English)

Web

Parallel SentenceDiscovery and Extraction

Sentence-Level Translation Aid

Sentence Matching

An effective similarity model between sentences in the source language (e.g., English sentences)

Research Issue

MotivationMotivation

04/21/23 6SCST@DASFAA 2011

• Existing approaches:▪ Word-based, e.g., translation model, edit

distance, …

▪ Gram-based, e.g., N-gram, V-gram

▪ All subsequences of a sentence

Cannot capture the order of words

Don’t consider the syntactic information

Too expensive

We propose a phrase-based similarity model1.Syntactic information 2.Frequency information3.Lengths of phrases

OutlineOutline

• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion

04/21/23 7SCST@DASFAA 2011

Problem DefinitionProblem Definition

04/21/23 8SCST@DASFAA 2011

DataData: : A Database of A Database of Parallel SentencesParallel Sentences

TranslatorTranslator

QueryQuery: : Query Sentence (Query Sentence (EnglishEnglish))

AnswerAnswer::Sentences with its translationsSentences with its translations

Sentence1: English - Chinese

Sentence2: English - Chinese

Sentence3: English - Chinese

Phrase-Based Sentence MatchingPhrase-Based Sentence Matching

04/21/23 9SCST@DASFAA 2011

q

Phrase f1

Phrase f2

Phrase fn

……

sPhrase f’1

Phrase f’2

Phrase f’n

……

SimilarityModel

SimilarityModel

Parallel SentencesParallel SentencesPhrase

SelectionPhrase

Selection

Phrase DatabasePhrase Database

OfflineOffline

OnlineOnline

OutlineOutline

• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion

04/21/23 10SCST@DASFAA 2011

Phrase-Based Similarity ModelPhrase-Based Similarity Model

04/21/23 11SCST@DASFAA 2011

q

Phrase f1

Phrase f2

Phrase fn

……

sPhrase f’1

Phrase f’2

Phrase f’n

……

SimilarityModel

SimilarityModel

Parallel SentencesParallel SentencesPhrase

SelectionPhrase

Selection

Phrase DatabasePhrase Database

OfflineOffline

OnlineOnline

Similarity ModelSimilarity Model

04/21/23 12SCST@DASFAA 2011

sim(q,s) = ∑f ∈Fq∩Fs φ(q,f) φ(s,f)

Query Sentence, qQuery Sentence, q

A Sentence in the DB, sA Sentence in the DB, s

PhrasePhraseSet, Set, FFqq

PhrasePhraseSet, Set, FFss

f1, f2, f3, ……, fm

f'1', f'2, f'3, ……, f'n

w(f)

φ(q,f):syntactic importance of f to q

φ(s,f):syntactic importance of f to s

Shared Phrases:

f ∈Fq∩Fs Shared Phrases:

f ∈Fq∩Fs w(f):weight of f

(IDF)

Fq∩Fs

Fs

Syntactic Importance of PhrasesSyntactic Importance of Phrases

04/21/23 13SCST@DASFAA 2011

φ(q,f)

Sentence Sentence qq

Phrase Phrase ff

He has eaten an apple

he eaten apple

= Πm α m Πg β g

has anGapGap

Dependency TreeDependency Tree

eateneaten

hehe appleapple hashas

anan

α0

d·α0 d·α0 d·α0

d2·α0d: a decay factor

β g : penalty(constant)

α m : syntactic weight of matched term

Features of the Similarity ModelFeatures of the Similarity Model

• More General▪Subsumes Jaccard, Cosine similarity,…

• Syntactic Information▪Weight of matched terms▪Weight of terms in the gap

• Frequency Information▪Weight of phrases

04/21/23 14SCST@DASFAA 2011

OutlineOutline

• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion

04/21/23 15SCST@DASFAA 2011

High-Quality Phrase SelectionHigh-Quality Phrase Selection

04/21/23 16SCST@DASFAA 2011

q

Phrase f1

Phrase f2

Phrase fn

……

sPhrase f’1

Phrase f’2

Phrase f’n

……

SimilarityModel

SimilarityModel

Parallel SentencesParallel SentencesPhrase

SelectionPhrase

Selection

Phrase DatabasePhrase Database

OfflineOffline

OnlineOnline

High-Quality PhraseHigh-Quality Phrase

• Extend grams by allowing discontinuous terms• A heuristic for selecting phrases

▪ Gap constraint: syntactic relationship of discontinuous terms

▪ Frequency constraint: infrequent (large IDF)▪ Maximum constraint: 1) not a prefix; 2) max. length

04/21/23 17SCST@DASFAA 2011

He has eaten an appleSentence Sentence qq

he eaten apple

syntactic

Frequency# of sentences

In the DB having it

Phrase SelectionPhrase Selection

• Selecting phrases with gap and maximum constraints

04/21/23 18SCST@DASFAA 2011

He ate a red appleSentence Sentence ss

hehe eateat redred appleapple

Sentence Graph1)Sequential relationship2)Syntactic relationship

• Longest path from a node = A phrase satisfying• Gap constraint• Maximum constraint

Phrase SelectionPhrase Selection

04/21/23 19SCST@DASFAA 2011

• Select phrases with frequency constraint (Threshold = 2)

Sentences in the DB

He has an apple

He ate a red apple

He has a pencil

He has

N0(8)

N1(4)

N2(3)

N27(1) N4(1)

N28(0) N5(0)

he

have

pencil apple

# #

N9(1)

eat

N11(1)

red

N15(1)

apple

N13(1)

apple

#

N14(0)

haveeat

red

……

Use a

frequency trie

N29(0)

#

Prune freq-uent phrases

OutlineOutline

• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion

04/21/23 20SCST@DASFAA 2011

Experiment SetupExperiment Setup

• Data Sets▪DI: 520,899 parallel sentences from ICIBA

▪DC: 800,000 parallel sentences from CNKI

• Baseline Methods▪ Jaccard Coefficient, Edit Distance, Cosine

Similarity▪Translation Model Methods (TM)▪Cosine Similarity with VGRAM

04/21/23 21SCST@DASFAA 2011

Experiment SetupExperiment Setup

• Evaluation Metrics▪BLEU

◦ A well known metric for machine translation◦ Example:

▪Precision◦ A user study to label whether the translations are

useful

04/21/23 22SCST@DASFAA 2011

qq: He has eaten an apple: He has eaten an apple

ss: He has a pencil: He has a pencil

他吃了一个苹果他吃了一个苹果他有一支铅笔他有一支铅笔

Ref. Translation

Translation

BLEU

Effects of Phrase SelectionEffects of Phrase Selection

04/21/23 23SCST@DASFAA 2011

Effect on max. length on DI Effect on freq. threshold on DC

Comparison with Similarity ModelsComparison with Similarity Models

04/21/23 24SCST@DASFAA 2011

Comparison on the DI data set

Comparison with Existing MethodsComparison with Existing Methods

04/21/23 25SCST@DASFAA 2011

Comparison on the DC data set

User StudiesUser Studies

• Methods used in commercial systems

04/21/23 26SCST@DASFAA 2011Comparison on the DI data set

OutlineOutline

• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion

04/21/23 27SCST@DASFAA 2011

ConclusionConclusion

• Searching closest sentence translations from the Web

• A phrase-based sentence similarity model

• High-quality phrase selection methods

• Extensive experiments and user studies

04/21/23 28SCST@DASFAA 2011

04/21/23 SCST@DASFAA 2011 29

Thanks

My Homepage: http://dbgroup.cs.tsinghua.edu/fanju

Frequency ConstraintFrequency Constraint

• Index structures▪Phrase Sentence

• Frequent phrases large inverted index

04/21/23 30SCST@DASFAA 2011

top related