comparable corpora kashyap popat(113050023) rahul sharnagat(11305r013)

Comparable Corpora

Kashyap Popat(113050023)

Rahul Sharnagat(11305R013)

Outline Motivation Introduction: Comparable Corpora Types of corpora Methods to extract information from

comparable corpora Bilingual dictionary Parallel sentences

Conclusion

Motivation Corpus: the most basic requirement in

statistical NLP Large amount of bilingual text on web Bilingual Dictionary generation

One to one correspondence between words Parallel Corpus generation

One to one correspondence between sentences Very rare resource (Hindi – Chinese)

Comparable corpora[7]

“A comparable corpus is one which selects similar texts in more than one language or variety. There is as yet no agreement on the nature of the similarity, because there are very few examples of comparable corpora.”

Characteristics of Comparable corpora No parallel sentences No parallel paragraphs Fewer overlapping terms and words

Definition by EAGLES

Spectrum of Corpora

Unrelated corpora

Comparable corpora

Parallel corporaTranscription

- sentence by sentence aligned

A comparable corpora

Application of comparable corpora Generating bilingual lexical entries

(dictionary) Creating parallel corpora

Generating bilingual lexical entries

Basic postulates[1] Words with productive context in one

language translate to word with productive context in second language

e.g., table मे�ज़ Words with rigid context translate into words

with rigid context e.g., Haemoglobin रक्ता�णु� Correlation: between co-occurrence pattern in

different languages

Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus, Fung, 1995

Co-occurrence patterns[4]

If a term A co-occurs with another term B in some text T then its translation A' also co-occurs with B‘(translation of B) in some other text T'

Automatic Identification of Word Translations from Unrelated English and German Corpora. R. Rapp, 1999

T T’AB

A’

B’

Co-occurrence Histogram[2] For the word ‘Debenture’

Words

Count

Finding terminology translations from non-parallel corpora. Fung, 1997

Basic Approach[3]

Calculate the co-occurrence matrix for all the words in source language L1 and target language L2

Word order of the L1 matrix is permuted until the resulting pattern is most similar to that of the L2 matrix

Identifying word translations in nonparallel, Rapp, R. ,1995

English co-occurrence matrix L1 matrix

1 2 3 4 5 6

Book 1

Garden 2

Plant 3

School 4

Sky 5

Teacher 6

Hindi co-occurrence matrix L2 matrix

1 2 3 4 5 6

आका�श 1

पा�ठश�ला� 2

शिशक्षका 3

बगी�चा� 4

किकाता�ब 5

पा�धा� 6

Hindi co-occurrence matrix L2 matrix: after permutations

5 4 6 2 1 3

किकाता�ब 5

बगी�चा� 4

पा�धा� 6

पा�ठश�ला� 2

आका�श 1

शिशक्षका 3

Result Comparing the order of the words in L1 matrix

and permuted L2 matrix

L1 index Word in L1 Word in L2 L2 index

1 Book किकाता�ब 5

2 Garden बगी�चा� 4

3 Plant पा�धा� 6

4 School पा�ठश�ला� 2

5 Sky आका�श 1

6 Teacher शिशक्षका 3

Problems Permuting co-occurrence matrix is expensive Size of the vector # of unique terms in the

language

A New Method[2]

Dictionary entries are used as seed words to generate correlation matrices

Algorithm: A bilingual list of known translation pairs (seed

words) is given Step-1: For every word ‘e’ in L1, find its correlation

vector (M1 ) with every word of L1 in the seed words Step-2: For every words ‘c’ in L2, find its correlation

vector (M2 ) with every word of L2 in the seed words Step-3: Compute correlation(M1, M2); if it is high,

‘e’ and ‘c‘ are considered as a translation pair


Co-occurrence

English Hindi

Garden बगी�चा�

… …

Plant पा�धा�

… …

… …

Sky आका�श

… …

… …

… …

Seed word List

Flower फू� ला

Crux of the Algorithm Two main steps:

Formation of co-occurrence matrix Measuring the similarity between vectors

Different possible methods to calculate above two steps

Advantage: vector size reduces to # of unique words in the seed list

Improvements Window Size for co-occurrence calculation[2]

Should it be same for all the words ?

Co-occurrence Counts Similarity Measure

)(

1_

swfrequencysizewindow


Co-occurrence count Mutual Information (Church & Hanks, 1989) Conditional Probability (Rapp, 1996) Chi-Square Test (Dunning, 1993) Log-likelihood Ratio (Dunning, 1993) TF-IDF (Fung et al 1998)

Automatic Identification of Word Translations from Unrelated English and German Corpora, R. Rapp.,1999

Mutual Information[2] (1/2)

k11 = # of segments where both, ws and wt occur

k12 = # of segments where only ws occur

k21 = # of segments where only wt occur

k22 = # of segments where neither words occur

Segments: sentences, paragraphs, or string groups delimited by anchor paints

22211211

1211)1Pr(kkkk

kkws

22211211

2111)1Pr(kkkk

kkwt

22211211

11)1,1Pr(kkkk

kww ss


Mutual Information[2] (2/2) Weighted mutual information

)1Pr()1Pr(

)1,1Pr(log)1,1Pr(),( 2

sx

sxsxsx ww

wwwwwwW

Similarity Measures (1/2) Cosine similarity (Fung and McKeown,1997) Jaccard similarity (Grefenstette,1994) Dice similarity (Rapp, 1999) L1 norm / City block distance (Jones & Furnas,

1987) L2 norm / Euclidean distance (Fung, 1997)

Automatic Identification of Word Translations from Unrelated English and German Corpora, R. Rapp.,1999

Similarity Measures (2/2) L1 norm / City block distance

L2 norm / Euclidian distance

Cosine Similarity

Jaccard Similarity

BA

BABAJ

),(

BA

BAcos

n

iii qpqpd

1

2)(),(

n

iii qpqpd

1

),(

Problems with the approach[5]

Coverage: only few corpus words are covered by the dictionary

Synonymy / Polysemy: several entries have the same meaning (synonymy), or an entry has several meanings (polysemy)

Similarities w.r.t. synonyms should not be independent

Improvements in the form of Geometric approaches Projects the co-occurrence vectors of source and

target word on a dictionary entries Measures the similarity between the projected vectors

A geometric view on bilingual lexicon extraction from comparable corpora. Gaussier, et al., 2004

Results

Paper Approach Method Corpus Accuracy

Fung et al. 1996

Word list based

Best candidate

English/Japanese

29%

Word list based

Top 20 candidate

output

English/Japanese

50.9%

Gaussier et al. 2004

Geometric Avg. Precision

English/French

44%

R. Rapp et al. 1999

Word list based

100Test words

English /French

72%

Generating parallel corpora

Generating Parallel Corpora Involves aligning the sentences in the

comparable corpora to form a parallel corpora Ways to do:

Dictionary matching Statistical methods

Ways to do alignment Dictionary matching

If the words in given two sentences are translation of each other, it is most likely that the sentences are translation of each other

Process is very slow Accuracy is high but cannot be applied to large corpus

Statistical methods To predict the alignment, these methods make use of

distribution of length of sentence in corpus either in terms of words (Brown, 1996) or characters (Gale and Church, 1991)

Makes no use of any lexical resources Fast and accurate

Length based statistical approach Preprocessing

Segment the text into tokens Combine the token into groups (nothing but

sentences) Find anchor point

Find points in corpus, where we are sure that start and end points in one language of the corpus aligns to start and end points in other language of the corpus

Finding these points require analysis of corpus

Example Brown et al., already had anchors in their

corpus Used UK parliament proceedings ‘Hansards‘ as a

parallel corpus Each proceeding start with a comment, time of

proceeding, who was giving the speech etc. This information provides the anchor points.

Sample text from Aligning Sentences in parallel corpora, P. Brown , Jeniffer Lei and Robert Mercer ,1996

Aligning anchor points Anchor points are not always perfect

Some may be missing Some may be garbled

To find the alignment between these anchors, Dynamic programming technique is used

We find an alignment of the major anchors in the two corpora with the least total cost

Beads Upper level view can be that corpus is

sequence of sentence lengths occasionally separated by paragraph markers

Each of these groupings is called a bead

Bead is a type of sentence grouping


Beads

Bead type Content

e Only one English sentence

f Only one French sentence

ef One English and one French sentence

eef Two English and one French sentence

eff One English and two French sentence

¶e One English paragraph

¶f One French paragraph

¶e¶f One English and one French paragraph

Example of beads

Problem Formulation Sentences between the anchored points get

generated by two random processes1. Producing a sequence of beads 2. Choosing the length of the sentence(s) in each

bead Bead generation can be modeled using a two

state Markov model One sentence can align to zero, one or two

sentence in the other side Allows any of the eight beads as shown in

previous table Assumptions , f)¶Pr(e)¶Pr(

)Pr()Pr(

)Pr()Pr(

eefeff

fe

Modeling length of sentence Model probability of length of sentence given its bead Assumptions are made: e-beads and f-beads: Probability of le or lf is same

as probability of le or lf in the whole corpus ef-bead:

English sentence: length le with probability Pr(le) French sentence: that log ratio of French to English

sentence length is normal distributed with mean µ and variance

Where r = log(lf | le )

)2

)(exp()|Pr(

2

2

r

ll ef

2

Contd.. eef-bead:

English sentence: drawn from Pr(le) French sentence: r is distributed according to same

normal distribution

eff-bead: same uniform distribution holds with r as English sentence: drawn from Pr(le) French sentence: r is distributed according to same

normal distribution

Given the sum of lengths of French sentences, probability for particular pair lf1 and lf2 is proportional to

)log(21 ee

f

ll

lr

)log( 21

e

ff

l

llr

)Pr()Pr(21 ff ll

Parameter Estimation Using EM Algorithm, estimate the parameters

of the Markov model Following results were obtained


Results In a random sample of 1000 sentences, only 6

were not translation of each other Brown et al. have also studied the effect of

anchors points According to them,

with paragraph marker but no anchor points, 2.0% error rate is expected

with anchor points but no paragraph marker, 2.3% error rate is expected

with neither anchor point nor paragraph marker, 3.2% error is rate expected

Conclusion Comparable corpora can be used to generate

bilingual dictionary and parallel corpora Generating bilingual dictionary

Polysemy and sense disambiguation still remains a major challenge

Generating parallel corpora Given the aligned points, aligner is likely to give

good results The experiments were very specific to corpora,

hard to generalize the accuracy The sentences of length which has a highest

chance to get aligned but with completely wrong translation might confuse the aligner

References 1. Fung, P. (1995). Compiling bilingual lexicon entries from a

non-parallel English-Chinese corpus. Proceedings of the 3rd Annual Workshop on Very Large Corpora, Boston, Massachusetts, 173-183

2. Fung, P.; McKeown, K. (1997). Finding terminology translations from non-parallel corpora. Proceedings of the 5th Annual Workshop on Very Large Corpora, Hong Kong, 192-202.

3. R. Rapp (1995). Identifying word translations in nonparallel texts. In: Proceedings of the 33rd Meeting of the Association for Computational Linguistics. Cambridge, Massachusetts, 320-322.

4. R. Rapp. (1999). Automatic Identification of Word Translations from Unrelated English and German Corpora. Proceedings of the ACL-99. pp. 1–17. College Park, USA.

References 5. Gaussier, Eric, Jean-Michel Renders, Irina

Matveeva, Cyril Goutte, and Herve Dejean. (2004). A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages 527–534, Barcelona, Spain.

6. Peter F. Brown, Jennifer C. Lai, and Robert L. Mercer (1991). Aligning sentences in parallel corpora. In Proceedings of the 29th annual meeting on Association for Computational Linguistics (ACL '91).

7. http://www.ilc.cnr.it/EAGLES/corpustyp/node21.html

http://www.ilc.cnr.it/EAGLES/corpustyp/node21.html

http://www.ilc.cnr.it/EAGLES/corpustyp/node21.html

comparable corpora kashyap popat(113050023) rahul sharnagat(11305r013)

Documents

language slide

b slide

examples of comparable

hindi cooccurrence matrix

english cooccurrence

german corpora

comparable corpus

word debenture words