comparable corpora kashyap popat(113050023) rahul sharnagat(11305r013)

44
Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Upload: lucinda-gilmore

Post on 26-Dec-2015

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Comparable Corpora

Kashyap Popat(113050023)

Rahul Sharnagat(11305R013)

Page 2: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Outline Motivation Introduction: Comparable Corpora Types of corpora Methods to extract information from

comparable corpora Bilingual dictionary Parallel sentences

Conclusion

Page 3: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Motivation Corpus: the most basic requirement in

statistical NLP Large amount of bilingual text on web Bilingual Dictionary generation

One to one correspondence between words Parallel Corpus generation

One to one correspondence between sentences Very rare resource (Hindi – Chinese)

Page 4: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Comparable corpora[7]

“A comparable corpus is one which selects similar texts in more than one language or variety. There is as yet no agreement on the nature of the similarity, because there are very few examples of comparable corpora.”

Characteristics of Comparable corpora No parallel sentences No parallel paragraphs Fewer overlapping terms and words

Definition by EAGLES

Page 5: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Spectrum of Corpora

Unrelated corpora

Comparable corpora

Parallel corporaTranscription

- sentence by sentence aligned

Page 6: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

A comparable corpora

Page 7: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Application of comparable corpora Generating bilingual lexical entries

(dictionary) Creating parallel corpora

Page 8: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Generating bilingual lexical entries

Page 9: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Basic postulates[1] Words with productive context in one

language translate to word with productive context in second language

e.g., table मे�ज़ Words with rigid context translate into words

with rigid context e.g., Haemoglobin रक्ता�णु� Correlation: between co-occurrence pattern in

different languages

Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus, Fung, 1995

Page 10: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Co-occurrence patterns[4]

If a term A co-occurs with another term B in some text T then its translation A' also co-occurs with B‘(translation of B) in some other text T'

Automatic Identification of Word Translations from Unrelated English and German Corpora. R. Rapp, 1999

T T’AB

A’

B’

Page 11: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Co-occurrence Histogram[2] For the word ‘Debenture’

Words

Count

Finding terminology translations from non-parallel corpora. Fung, 1997

Page 12: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Basic Approach[3]

Calculate the co-occurrence matrix for all the words in source language L1 and target language L2

Word order of the L1 matrix is permuted until the resulting pattern is most similar to that of the L2 matrix

Identifying word translations in nonparallel, Rapp, R. ,1995

Page 13: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

English co-occurrence matrix L1 matrix

1 2 3 4 5 6

Book 1

Garden 2

Plant 3

School 4

Sky 5

Teacher 6

Page 14: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Hindi co-occurrence matrix L2 matrix

1 2 3 4 5 6

आका�श 1

पा�ठश�ला� 2

शिशक्षका 3

बगी�चा� 4

किकाता�ब 5

पा�धा� 6

Page 15: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Hindi co-occurrence matrix L2 matrix: after permutations

5 4 6 2 1 3

किकाता�ब 5

बगी�चा� 4

पा�धा� 6

पा�ठश�ला� 2

आका�श 1

शिशक्षका 3

Page 16: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Result Comparing the order of the words in L1 matrix

and permuted L2 matrix

L1 index Word in L1 Word in L2 L2 index

1 Book किकाता�ब 5

2 Garden बगी�चा� 4

3 Plant पा�धा� 6

4 School पा�ठश�ला� 2

5 Sky आका�श 1

6 Teacher शिशक्षका 3

Page 17: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Problems Permuting co-occurrence matrix is expensive Size of the vector # of unique terms in the

language

Page 18: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

A New Method[2]

Dictionary entries are used as seed words to generate correlation matrices

Algorithm: A bilingual list of known translation pairs (seed

words) is given Step-1: For every word ‘e’ in L1, find its correlation

vector (M1 ) with every word of L1 in the seed words Step-2: For every words ‘c’ in L2, find its correlation

vector (M2 ) with every word of L2 in the seed words Step-3: Compute correlation(M1, M2); if it is high,

‘e’ and ‘c‘ are considered as a translation pair

Finding terminology translations from non-parallel corpora. Fung, 1997

Page 19: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Co-occurrence

English Hindi

Garden बगी�चा�

… …

Plant पा�धा�

… …

… …

Sky आका�श

… …

… …

… …

Seed word List

Flower फू� ला

Page 20: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Crux of the Algorithm Two main steps:

Formation of co-occurrence matrix Measuring the similarity between vectors

Different possible methods to calculate above two steps

Advantage: vector size reduces to # of unique words in the seed list

Page 21: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Improvements Window Size for co-occurrence calculation[2]

Should it be same for all the words ?

Co-occurrence Counts Similarity Measure

)(

1_

swfrequencysizewindow

Finding terminology translations from non-parallel corpora. Fung, 1997

Page 22: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Co-occurrence count Mutual Information (Church & Hanks, 1989) Conditional Probability (Rapp, 1996) Chi-Square Test (Dunning, 1993) Log-likelihood Ratio (Dunning, 1993) TF-IDF (Fung et al 1998)

Automatic Identification of Word Translations from Unrelated English and German Corpora, R. Rapp.,1999

Page 23: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Mutual Information[2] (1/2)

k11 = # of segments where both, ws and wt occur

k12 = # of segments where only ws occur

k21 = # of segments where only wt occur

k22 = # of segments where neither words occur

Segments: sentences, paragraphs, or string groups delimited by anchor paints

22211211

1211)1Pr(kkkk

kkws

22211211

2111)1Pr(kkkk

kkwt

22211211

11)1,1Pr(kkkk

kww ss

Finding terminology translations from non-parallel corpora. Fung, 1997

Page 24: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Mutual Information[2] (2/2) Weighted mutual information

)1Pr()1Pr(

)1,1Pr(log)1,1Pr(),( 2

sx

sxsxsx ww

wwwwwwW

Page 25: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Similarity Measures (1/2) Cosine similarity (Fung and McKeown,1997) Jaccard similarity (Grefenstette,1994) Dice similarity (Rapp, 1999) L1 norm / City block distance (Jones & Furnas,

1987) L2 norm / Euclidean distance (Fung, 1997)

Automatic Identification of Word Translations from Unrelated English and German Corpora, R. Rapp.,1999

Page 26: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Similarity Measures (2/2) L1 norm / City block distance

L2 norm / Euclidian distance

Cosine Similarity

Jaccard Similarity

BA

BABAJ

),(

BA

BAcos

n

iii qpqpd

1

2)(),(

n

iii qpqpd

1

),(

Page 27: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Problems with the approach[5]

Coverage: only few corpus words are covered by the dictionary

Synonymy / Polysemy: several entries have the same meaning (synonymy), or an entry has several meanings (polysemy)

Similarities w.r.t. synonyms should not be independent

Improvements in the form of Geometric approaches Projects the co-occurrence vectors of source and

target word on a dictionary entries Measures the similarity between the projected vectors

A geometric view on bilingual lexicon extraction from comparable corpora. Gaussier, et al., 2004

Page 28: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Results

Paper Approach Method Corpus Accuracy

Fung et al. 1996

Word list based

Best candidate

English/Japanese

29%

Word list based

Top 20 candidate

output

English/Japanese

50.9%

Gaussier et al. 2004

Geometric Avg. Precision

English/French

44%

R. Rapp et al. 1999

Word list based

100Test words

English /French

72%

Page 29: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Generating parallel corpora

Page 30: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Generating Parallel Corpora Involves aligning the sentences in the

comparable corpora to form a parallel corpora Ways to do:

Dictionary matching Statistical methods

Page 31: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Ways to do alignment Dictionary matching

If the words in given two sentences are translation of each other, it is most likely that the sentences are translation of each other

Process is very slow Accuracy is high but cannot be applied to large corpus

Statistical methods To predict the alignment, these methods make use of

distribution of length of sentence in corpus either in terms of words (Brown, 1996) or characters (Gale and Church, 1991)

Makes no use of any lexical resources Fast and accurate

Page 32: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Length based statistical approach Preprocessing

Segment the text into tokens Combine the token into groups (nothing but

sentences) Find anchor point

Find points in corpus, where we are sure that start and end points in one language of the corpus aligns to start and end points in other language of the corpus

Finding these points require analysis of corpus

Page 33: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Example Brown et al., already had anchors in their

corpus Used UK parliament proceedings ‘Hansards‘ as a

parallel corpus Each proceeding start with a comment, time of

proceeding, who was giving the speech etc. This information provides the anchor points.

Sample text from Aligning Sentences in parallel corpora, P. Brown , Jeniffer Lei and Robert Mercer ,1996

Page 34: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Aligning anchor points Anchor points are not always perfect

Some may be missing Some may be garbled

To find the alignment between these anchors, Dynamic programming technique is used

We find an alignment of the major anchors in the two corpora with the least total cost

Page 35: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Beads Upper level view can be that corpus is

sequence of sentence lengths occasionally separated by paragraph markers

Each of these groupings is called a bead

Bead is a type of sentence grouping

Sample text from Aligning Sentences in parallel corpora, P. Brown , Jeniffer Lei and Robert Mercer ,1996

Page 36: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Beads

Bead type Content

e Only one English sentence

f Only one French sentence

ef One English and one French sentence

eef Two English and one French sentence

eff One English and two French sentence

¶e One English paragraph

¶f One French paragraph

¶e¶f One English and one French paragraph

Example of beads

Page 37: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Problem Formulation Sentences between the anchored points get

generated by two random processes1. Producing a sequence of beads 2. Choosing the length of the sentence(s) in each

bead Bead generation can be modeled using a two

state Markov model One sentence can align to zero, one or two

sentence in the other side Allows any of the eight beads as shown in

previous table Assumptions , f)¶Pr(e)¶Pr(

)Pr()Pr(

)Pr()Pr(

eefeff

fe

Page 38: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Modeling length of sentence Model probability of length of sentence given its bead Assumptions are made: e-beads and f-beads: Probability of le or lf is same

as probability of le or lf in the whole corpus ef-bead:

English sentence: length le with probability Pr(le) French sentence: that log ratio of French to English

sentence length is normal distributed with mean µ and variance

Where r = log(lf | le )

)2

)(exp()|Pr(

2

2

r

ll ef

2

Page 39: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Contd.. eef-bead:

English sentence: drawn from Pr(le) French sentence: r is distributed according to same

normal distribution

eff-bead: same uniform distribution holds with r as English sentence: drawn from Pr(le) French sentence: r is distributed according to same

normal distribution

Given the sum of lengths of French sentences, probability for particular pair lf1 and lf2 is proportional to

)log(21 ee

f

ll

lr

)log( 21

e

ff

l

llr

)Pr()Pr(21 ff ll

Page 40: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Parameter Estimation Using EM Algorithm, estimate the parameters

of the Markov model Following results were obtained

Sample text from Aligning Sentences in parallel corpora, P. Brown , Jeniffer Lei and Robert Mercer ,1996

Page 41: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Results In a random sample of 1000 sentences, only 6

were not translation of each other Brown et al. have also studied the effect of

anchors points According to them,

with paragraph marker but no anchor points, 2.0% error rate is expected

with anchor points but no paragraph marker, 2.3% error rate is expected

with neither anchor point nor paragraph marker, 3.2% error is rate expected

Page 42: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

Conclusion Comparable corpora can be used to generate

bilingual dictionary and parallel corpora Generating bilingual dictionary

Polysemy and sense disambiguation still remains a major challenge

Generating parallel corpora Given the aligned points, aligner is likely to give

good results The experiments were very specific to corpora,

hard to generalize the accuracy The sentences of length which has a highest

chance to get aligned but with completely wrong translation might confuse the aligner

Page 43: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

References 1. Fung, P. (1995). Compiling bilingual lexicon entries from a

non-parallel English-Chinese corpus. Proceedings of the 3rd Annual Workshop on Very Large Corpora, Boston, Massachusetts, 173-183

2. Fung, P.; McKeown, K. (1997). Finding terminology translations from non-parallel corpora. Proceedings of the 5th Annual Workshop on Very Large Corpora, Hong Kong, 192-202.

3. R. Rapp (1995). Identifying word translations in nonparallel texts. In: Proceedings of the 33rd Meeting of the Association for Computational Linguistics. Cambridge, Massachusetts, 320-322.

4. R. Rapp. (1999). Automatic Identification of Word Translations from Unrelated English and German Corpora. Proceedings of the ACL-99. pp. 1–17. College Park, USA.

Page 44: Comparable Corpora Kashyap Popat(113050023) Rahul Sharnagat(11305R013)

References 5. Gaussier, Eric, Jean-Michel Renders, Irina

Matveeva, Cyril Goutte, and Herve Dejean. (2004). A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages 527–534, Barcelona, Spain.

6. Peter F. Brown, Jennifer C. Lai, and Robert L. Mercer (1991). Aligning sentences in parallel corpora. In Proceedings of the 29th annual meeting on Association for Computational Linguistics (ACL '91).

7. http://www.ilc.cnr.it/EAGLES/corpustyp/node21.html