bilingual terminology mining

1

Bilingual Terminology Mining

Estelle Delpech30th November, 2010

4th intensive summer school on Natural Language Processing

2

About me

● Estelle Delpech

● Research engineer at Lingua et Machina, France

● CAT tools provider● ed(at)lingua-et-machina(dot)com● www.lingua-et-machina.com

● Ph. Candidate at LINA, France● taln team : specialises in NLP● estelle.delpech(at)univ-nantes(dot)fr

http://www.lingua-et-machina.com/

3

Presentation outline

● About terms, terminology, terminology mining

● Term Extraction● Term Alignment

4




5

What is a term ?

● Classical definition : ● “unequivocal expression of a concept

within a technical domain“● Traces back to 1930 Eugene Wüster

« General Theory of Terminology »● Specialized language is / should be

unambiguous

concept

term referent

Ogden semiotic triangle

6

What is a term ?

“Classical terminology challenged in the 1990's by :

● sociolinguistics● corpus-based linguistics● computational terminology

● Observe terms in texts : ● there is variation, polysemy ● concepts evolve overtime● no clear-cut border between

specialized and general languages

7

What is a term ?

● Definition of « term » depends on the application / audience of the terminology

● Domain expert :● Unit of knowledge

● Information retrieval : ● Descriptors for indexation

● Translation ● word or phrase that :

● is not part of general language ● Translates differently in a particular

domain● can be :

● Noun, adjective, verb● Noun phrase, verb phrase, etc.

8

What is a terminology ?

● Set of terms + terminological records● Terminological record :

● Part-speech ● Frequency● Variants● contexts

● Relations between terms / concepts● Hypernoymy : cat is a sort of animal● Meronymy : head is part of body

● Bilingual terminology :● Translation relations

9

http://www.termiumplus.gc.ca/

10

Were do you find terms ?

● In specialized texts :● Research papers on breast cancer● Planes crashes reports

● Corpora building : ● important to gather texts following

a well-defined domain / thematic

11

term extractiondata mining

Bilingual terminology mining (1)

bilingualterminology

Specialized texts

termsterms

term alignment

terminologymanagement

software

12

synchronizedterm extractionand alignment

Bilingual terminology mining (2)

bilingualterminology

Specialized texts

terms terms

terminologymanagement

software

13




Term extraction : semi-supervised process

● The notion of term is « slippery »● The same lexical unit may or may not be

considered as a term depending on : ● Audience● Domain● Application

● Term extractors extract candidate terms● Frequent in texts of a given domain

● HER2 gene● Look like terms : well-formed phrase

● human cell lines● Group of words that frequently occur

together ● to compile a programL'Homme, 2004

Term extraction : semi-supervised, lexico-semantic process

specialized texts

term extractor

manual selection

candidate terms candidate terms

terms

automaticindexing

terms

terminology

texts

terms

concepts

Termhood clues (1) : Frequency

● Term occurs frequently in specialized texts● the higher, the better ?

● Comparison with general language :● Does the term occur more frequently

than expected in general language ?

● Compute significance tests : ● ex : ² chi-square

L'Homme, 2004

17

Termhood clues (2) : form

● A term is a well-formed phrase● ...HER2/neu oncogenes are members of...

● Match morpho-syntactic patterns ● Ex: NOUN + NOUN

● Many : ● NOUN PREP DET NOUN● alternation of the gene

● NOUN PREP NOUN COORD ADJ NOUN● susceptibility to breast and ovarian cancer

● NOUN NOUN NOUN NOUN NOUN● human breast cancer cell lines

Termhood clues (2) : form

● Preprocessing : ● Tokenization ● Lemmatisation● POS Tagging

… HER-2/neu oncogenes are members of ....

HER-2/neu oncogenes are members of

NOUN NOUN VERB NOUN PREP

HER-2/neu oncogene be member of

Identification of Syntactic Patterns

● Patterns expressed as regular expression / Finite state automata

START

PREP

NOUN NOUN

NOUN (PREP? NOUN) ?

● NOUN : gene● NOUN NOUN : HER2 gene ● NOUN PREP NOUN : member of family

Term hood clue (3) : words association

● Significant coocurrences are good clues for term hood :

● … breast cancer … ● ...breast remains...● .. alternative cancer...

● Must take into account :● number of times the two word cooccur● number of times word A occurs● number of times word B occurs

Measure for cooccurrence significance

● Mutual InformationMI a ,b= log2

P a ,bP a⋅P b

P a , b=nbocc a ,b /NP a=nbocc a/N

N=total nbof words in corpus

● remarkable attraction between invasive and carcinoma despite relatively low number of cooccurrencesChurch and Hanks, 1990

L'Homme, 2004

invasive carcinoma 20

invasive 30

carcinoma 20

MI 9,7

cancer means 50

cancer 800

means 800

MI 1,69

22




23




● in parallel corpora● in comparable corpora

24

Parallel and comparable corpora

● Parallel corpora● Source text and target texts are translations● Reduce search space little by little

● First sentences● Then terms

● Comparable corpora● Not translation but very similar in topic ● Good proportion of terms translations● Search space :

● All terms of target corpus

25

Sentence alignement (1)

● Gale and Church (1993) 's hypothesis : ● Translated sentences have roughly the

same length● Probability P(S,T) that sentence S

translates into T is based on the length difference

● Improvements : use seed-lexicon● Probability P(S,T) is based on the

number of words in common

Gale and Church, 1993

26


● Compute probabilites for all pairs of (S,T)● Build matrix where M(i,j) contains probability

that sentence i translates to sentence j


0 1 2 ... n

0 0,89 0,56 0,2 ... ...

1 0,45 0,9 0,1 ... ...

2 ... 0,23 0,9 0,3 ...

... ... ... 0,44 0,76 ...

m ... ... ... ... 0,88

27


● Use dynamic programming to find the best “path” i.e. the best alignments


0 1 2 ... n

0 0,89 0,56 0,2 ... ...

1 0,45 0,9 0,1 ... ...

2 ... 0,23 0,9 0,3 ...

... ... ... 0,44 0,76 ...

m ... ... ... ... 0,88

28

Sub sentence alignment : AnyMalign (Lardilleux, 2010)

Lardilleux et al., 2010

● AnyMalign is a sub-sentencial aligner● Aligns words, groups of words for MT

translation tables● Aligned group of words :

● more or less like statistical collocations● possible to find term patterns in these

groups of words

29

AnyMalign (Lardilleux, 2010)


a ↔ A is a perfect alignment

a d ↔ A D b ↔ Bb ↔ C

a e ↔ A DD

● Algorithm is based on « perfect alignments » :● words or groups of words that occur

exactly in the same aligned sentences

30



Sub corpora 1 : b ↔ BSub corpora 2 : a ↔ A


a e ↔ A DD

● How to get more « perfect alignments » ? ● with smaller corpora

● How to get smaller corpora ? ● randomly select sub corpora from your

corpora

Subcorpora 1

Subcorpora 2

Sub corpora 1 : b ↔ BSub corpora 2 : a ↔ A

31




a e ↔ A DD

● Complementaires of perfect alignments are likely to be good alignments too :

● Perfect alignment a ↔ A● Complementaries d ↔ De ↔ DD

32



● Process : Iteratively extract random samples of of random size from your corpora

● Extract « perfect alignements » and their complementary

● The same alignment can occur several times

● Count, for each alignement the number of times it occurs

33



● Output : ● alignments sorted by descending number of

occurrences● Alignement probability :

P S∣T =C S ,T C T

S = source group of wordsT = target group of wordsC (S,T) = number of times S was aligned with

TC (T) = number of times T appears in an

alignment

34



Advantages :● can perform alignment with more than 2

languages at the same time● 1 language → statistical collocations

● Extracts and aligns non contiguous sequences of words

to give something upto let someone down

● No a priori expectations on terms● Sometimes a term in source

language is not translated by a term● Terms = what you can align

35



● Words groups are not grammatical phrases :

that sample sentences and exchange format fitted for the

but not● Solutions :

● find term patterns● use heuristics

● trim stop words

sample sentences exchange format

36




● in parallel corpora● in comparable corpora

37

Advantages of comparable corpora

● More available● new languages● new language pairs● new topics / domains

● Less expensive to build● More natural

● data was produced spontaneously

● no influence from source text

38

Contextual approach

● Based on distributional linguistics (Z. Harris)

● Words with similar meaning appear in similar contexts

● If source and target words have similar contexts, they might be translations

● Compute contexts for each source and target word

● Compare contexts● Find the most similar contexts

39

Contextual approach

● Representation of the context of a given word with a vector :

● Head word + collocates

water

beer mou

th

glass

drink ● ● ● ... ●

● Vector associates « head » word with most frequent collocates

● + some indication of the force of association between head-word and collocates

40

Building context vector for « drink »

● Collocates : word occuring at a distance of n words from head

is variety of reasons to drink plenty of water each day

simple as a glass of drinking water be the key to the

popular in Japan today to drink water from glass after waking

● (drink,water) = 3● (drink, glass) = 2● (drink, Japan) = 1● (drink, reason) = 1● (drink, plenty) = 1

41

Normalized cooccurrences frequency

● Ex : log likelihood ratio● 1000 cooc. in corpus● (drink,x) = 75 cooc.● (water,y) = 75 cooc.● (drink, water) = 25 cooc.

water ¬ waterdrink 50 25 75¬ drink 25 900 925

75 925 1000

● Normalization : use measure like IM, log likehood ratio to counteract the influence of high frequency words

Dunning, 1993

42

Log likelyhood ratio

● loglikelihoo ratio (drink,water) = 45,05

water ¬ waterdrink a b e¬ drink c d h

f g N

● Contingency table :

log likelihood ratio water , drink =log a b log bc log c d log d N log N

−e log e − f log f −g log g −h logh

Dunning, 1993

43

Context vector comparison

● Compute context vectors for words in source and target corpus

water

beer mou

th

glass

drink ● ● ● ... ●

● How to compare words contexts in different languages ?

น�� เบ�ยร ป�

กแก�ว

ด ��ม ● ● ● ... ●

Rapp 1995 ; Fung 1997

44


● Use seed lexicon to map collocates

water

beer mou

th

glass

drink ● ● ● ... ●

น�� เบ�ยร ป�

กแก�ว

ด ��ม ● ● ● ... ●

thaï-englishseed lexicon

Rapp 1995 ; Fung 1997

45


● Measuring context similarity of words a and b

● = measuring cosinus angle between vector of a and vector of b

cosinus anglea ,b=∑c∈a∪b

w c , a⋅w c ,b

∑c∈a w c , a2 ⋅∑

c∈bw c ,b

2

c∈x=collocate in vector of xw c , x =weight of association of collocate c withhead x

● Select the top 1, 10 or 20 most closest words as candidate translations

Rapp 1995 ; Fung 1997

46

Contextual approach : improvements

● Using syntactic collocates● Improving dictionary with cognates,

transliterations, other dictionaries ● Give more weight to « anchor words »

● cognates, transliterations● frequent, monosemous

● Filter with part-of-speech● Favor reciprocal translations

cb

d

a

c'b'

d'

a'SOURCE TARGET

Chiao et Zweignebaum, 2002Sadat et al., 2003Gamallo and Campos, 2005Kohen and Knight, 2002Prochasson, 2010

47

Variant to direct translation of vector

● « Interlingual » translation● Translate the n-closest words instead of

context vector● Seed lexicon : some mappings between

source and target wordsSOURCE TARGET

seed lexicon

Déjean and Gaussier, 2002

48


● To translate term T : ● Find n-closest words● these closest words are in the lexicon

SOURCE TARGET

seed lexicon


49


● Find the target term which is the closest to the n closest words

SOURCE TARGET

seed lexicon


50


● « Interlingual » approach● Translate closest words instead of direct

context

SOURCE TARGET


51

Adaptation to multi-word terms

● Context vector :● Union of vector of each word of the terms

Morin et al., 2004Morin and Daille, 2009

stron

gbe

er ...glass

energy ● ● ● ... ...

stron

gbe

er mouth

glass

energydrink

● ● ● ... ●

... beer mou

th

glass

drink ... ● ● ... ●

52

Evaluation

Single word units

big, general language corpus

80%

Multi-word units

small, specialized corpus

60%

Multi-word terms

small, specialized corpus

42%

● big = hundreds milliions of words● small = one million to 100 thousand

words vectorMorin and Daille, 2010

● Precison on TopN candidates● 50% on Top20● Correct translation is in the Top 20 best

candidates for 50% of source terms

53

Why is it so difficult ?

● translation might not be present● target term has not been extracted● polysemous words : undiscriminant,

fuzzy vector● low frequency words : unsignificant

vector● translation has different usage in target

language● big search space : all words of target

corpus→ can not be fully automatic→ semi supervised term alignment

54

Thank you

ed(a)lingua-et-machina.com

Franco-Thai Workshop 20104th intensive summer school on Natural Language Processing

bilingual terminology mining

Technology

gene noun noun

definition of term

notion of term

classical terminology

her2 gene noun prep

term hood clue

noun prep det noun alternation

verb noun phrase