bilingual terminology mining

54
1 Bilingual Terminology Mining Estelle Delpech 30 th November, 2010 4 th intensive summer school on Natural Language Processing

Upload: estelle-delpech

Post on 11-Jun-2015

170 views

Category:

Technology


0 download

DESCRIPTION

Material of the 4th Intensive Summer school and collaborative workshop on Natural Language Processing (NAIST Franco-Thai Workshop 2010). Bangkok, Thaıland.

TRANSCRIPT

Page 1: Bilingual terminology mining

1

Bilingual Terminology Mining

Estelle Delpech30th November, 2010

4th intensive summer school on Natural Language Processing

Page 2: Bilingual terminology mining

2

About me

● Estelle Delpech

● Research engineer at Lingua et Machina, France

● CAT tools provider● ed(at)lingua-et-machina(dot)com● www.lingua-et-machina.com

● Ph. Candidate at LINA, France● taln team : specialises in NLP● estelle.delpech(at)univ-nantes(dot)fr

Page 3: Bilingual terminology mining

3

Presentation outline

● About terms, terminology, terminology mining

● Term Extraction● Term Alignment

Page 4: Bilingual terminology mining

4

Presentation outline

● About terms, terminology, terminology mining

● Term Extraction● Term Alignment

Page 5: Bilingual terminology mining

5

What is a term ?

● Classical definition : ● “unequivocal expression of a concept

within a technical domain“● Traces back to 1930 Eugene Wüster

« General Theory of Terminology »● Specialized language is / should be

unambiguous

concept

term referent

Ogden semiotic triangle

Page 6: Bilingual terminology mining

6

What is a term ?

“Classical terminology challenged in the 1990's by :

● sociolinguistics● corpus-based linguistics● computational terminology

● Observe terms in texts : ● there is variation, polysemy ● concepts evolve overtime● no clear-cut border between

specialized and general languages

Page 7: Bilingual terminology mining

7

What is a term ?

● Definition of « term » depends on the application / audience of the terminology

● Domain expert :● Unit of knowledge

● Information retrieval : ● Descriptors for indexation

● Translation ● word or phrase that :

● is not part of general language ● Translates differently in a particular

domain● can be :

● Noun, adjective, verb● Noun phrase, verb phrase, etc.

Page 8: Bilingual terminology mining

8

What is a terminology ?

● Set of terms + terminological records● Terminological record :

● Part-speech ● Frequency● Variants● contexts

● Relations between terms / concepts● Hypernoymy : cat is a sort of animal● Meronymy : head is part of body

● Bilingual terminology :● Translation relations

Page 9: Bilingual terminology mining

9

http://www.termiumplus.gc.ca/

Page 10: Bilingual terminology mining

10

Were do you find terms ?

● In specialized texts :● Research papers on breast cancer● Planes crashes reports

● Corpora building : ● important to gather texts following

a well-defined domain / thematic

Page 11: Bilingual terminology mining

11

term extractiondata mining

Bilingual terminology mining (1)

bilingualterminology

Specialized texts

termsterms

term alignment

terminologymanagement

software

Page 12: Bilingual terminology mining

12

synchronizedterm extractionand alignment

Bilingual terminology mining (2)

bilingualterminology

Specialized texts

terms terms

terminologymanagement

software

Page 13: Bilingual terminology mining

13

Presentation outline

● About terms, terminology, terminology mining

● Term Extraction● Term Alignment

Page 14: Bilingual terminology mining

Term extraction : semi-supervised process

● The notion of term is « slippery »● The same lexical unit may or may not be

considered as a term depending on : ● Audience● Domain● Application

● Term extractors extract candidate terms● Frequent in texts of a given domain

● HER2 gene● Look like terms : well-formed phrase

● human cell lines● Group of words that frequently occur

together ● to compile a programL'Homme, 2004

Page 15: Bilingual terminology mining

Term extraction : semi-supervised, lexico-semantic process

specialized texts

term extractor

manual selection

candidate terms candidate terms

terms

automaticindexing

terms

terminology

texts

terms

concepts

Page 16: Bilingual terminology mining

 Termhood  clues (1) : Frequency

● Term occurs frequently in specialized texts● the higher, the better ?

● Comparison with general language :● Does the term occur more frequently

than expected in general language ?

● Compute significance tests : ● ex : ² chi-square

L'Homme, 2004

Page 17: Bilingual terminology mining

17

Termhood clues (2) : form

● A term is a well-formed phrase● ...HER2/neu oncogenes are members of...

● Match morpho-syntactic patterns ● Ex: NOUN + NOUN

● Many : ● NOUN PREP DET NOUN● alternation of the gene

● NOUN PREP NOUN COORD ADJ NOUN● susceptibility to breast and ovarian cancer

● NOUN NOUN NOUN NOUN NOUN● human breast cancer cell lines

Page 18: Bilingual terminology mining

Termhood clues (2) : form

● Preprocessing : ● Tokenization ● Lemmatisation● POS Tagging

… HER-2/neu oncogenes are members of ....

HER-2/neu oncogenes are members of

NOUN NOUN VERB NOUN PREP

HER-2/neu oncogene be member of

Page 19: Bilingual terminology mining

Identification of Syntactic Patterns

● Patterns expressed as regular expression / Finite state automata

START

PREP

NOUN NOUN

NOUN (PREP? NOUN) ?

● NOUN : gene● NOUN NOUN : HER2 gene ● NOUN PREP NOUN : member of family

Page 20: Bilingual terminology mining

Term hood clue (3) : words association

● Significant coocurrences are good clues for term hood :

● … breast cancer … ● ...breast remains...● .. alternative cancer...

● Must take into account :● number of times the two word cooccur● number of times word A occurs● number of times word B occurs

Page 21: Bilingual terminology mining

Measure for cooccurrence significance

● Mutual InformationMI a ,b= log2

P a ,bP a⋅P b

P a , b=nbocc a ,b /NP a=nbocc a/N

N=total nbof words in corpus

● remarkable attraction between invasive and carcinoma despite relatively low number of cooccurrencesChurch and Hanks, 1990

L'Homme, 2004

invasive carcinoma 20

invasive 30

carcinoma 20

MI 9,7

cancer means 50

cancer 800

means 800

MI 1,69

Page 22: Bilingual terminology mining

22

Presentation outline

● About terms, terminology, terminology mining

● Term Extraction● Term Alignment

Page 23: Bilingual terminology mining

23

Presentation outline

● About terms, terminology, terminology mining

● Term Extraction● Term Alignment

● in parallel corpora● in comparable corpora

Page 24: Bilingual terminology mining

24

Parallel and comparable corpora

● Parallel corpora● Source text and target texts are translations● Reduce search space little by little

● First sentences● Then terms

● Comparable corpora● Not translation but very similar in topic ● Good proportion of terms translations● Search space :

● All terms of target corpus

Page 25: Bilingual terminology mining

25

Sentence alignement (1)

● Gale and Church (1993) 's hypothesis : ● Translated sentences have roughly the

same length● Probability P(S,T) that sentence S

translates into T is based on the length difference

● Improvements : use seed-lexicon● Probability P(S,T) is based on the

number of words in common

Gale and Church, 1993

Page 26: Bilingual terminology mining

26

Sentence alignement (2)

● Compute probabilites for all pairs of (S,T)● Build matrix where M(i,j) contains probability

that sentence i translates to sentence j

Gale and Church, 1993

0 1 2 ... n

0 0,89 0,56 0,2 ... ...

1 0,45 0,9 0,1 ... ...

2 ... 0,23 0,9 0,3 ...

... ... ... 0,44 0,76 ...

m ... ... ... ... 0,88

Page 27: Bilingual terminology mining

27

Sentence alignement (2)

● Use dynamic programming to find the best “path” i.e. the best alignments

Gale and Church, 1993

0 1 2 ... n

0 0,89 0,56 0,2 ... ...

1 0,45 0,9 0,1 ... ...

2 ... 0,23 0,9 0,3 ...

... ... ... 0,44 0,76 ...

m ... ... ... ... 0,88

Page 28: Bilingual terminology mining

28

Sub sentence alignment : AnyMalign (Lardilleux, 2010)

Lardilleux et al., 2010

● AnyMalign is a sub-sentencial aligner● Aligns words, groups of words for MT

translation tables● Aligned group of words :

● more or less like statistical collocations● possible to find term patterns in these

groups of words

Page 29: Bilingual terminology mining

29

AnyMalign (Lardilleux, 2010)

Lardilleux et al., 2010

a ↔ A is a perfect alignment

a d ↔ A D b ↔ Bb ↔ C

a e ↔ A DD

● Algorithm is based on « perfect alignments » :● words or groups of words that occur

exactly in the same aligned sentences

Page 30: Bilingual terminology mining

30

AnyMalign (Lardilleux, 2010)

Lardilleux et al., 2010

Sub corpora 1 : b ↔ BSub corpora 2 : a ↔ A

a d ↔ A D b ↔ Bb ↔ C

a e ↔ A DD

● How to get more « perfect alignments » ? ● with smaller corpora

● How to get smaller corpora ? ● randomly select sub corpora from your

corpora

Subcorpora 1

Subcorpora 2

Sub corpora 1 : b ↔ BSub corpora 2 : a ↔ A

Page 31: Bilingual terminology mining

31

AnyMalign (Lardilleux, 2010)

Lardilleux et al., 2010

a d ↔ A D b ↔ Bb ↔ C

a e ↔ A DD

● Complementaires of perfect alignments are likely to be good alignments too :

● Perfect alignment a ↔ A● Complementaries d ↔ De ↔ DD

Page 32: Bilingual terminology mining

32

AnyMalign (Lardilleux, 2010)

Lardilleux et al., 2010

● Process : Iteratively extract random samples of of random size from your corpora

● Extract « perfect alignements » and their complementary

● The same alignment can occur several times

● Count, for each alignement the number of times it occurs

Page 33: Bilingual terminology mining

33

AnyMalign (Lardilleux, 2010)

Lardilleux et al., 2010

● Output : ● alignments sorted by descending number of

occurrences● Alignement probability :

P S∣T =C S ,T C T

S = source group of wordsT = target group of wordsC (S,T) = number of times S was aligned with

TC (T) = number of times T appears in an

alignment

Page 34: Bilingual terminology mining

34

AnyMalign (Lardilleux, 2010)

Lardilleux et al., 2010

Advantages :● can perform alignment with more than 2

languages at the same time● 1 language → statistical collocations

● Extracts and aligns non contiguous sequences of words

to give something upto let someone down

● No a priori expectations on terms● Sometimes a term in source

language is not translated by a term● Terms = what you can align

Page 35: Bilingual terminology mining

35

AnyMalign (Lardilleux, 2010)

Lardilleux et al., 2010

● Words groups are not grammatical phrases :

that sample sentences and exchange format fitted for the

but not● Solutions :

● find term patterns● use heuristics

● trim stop words

sample sentences exchange format

Page 36: Bilingual terminology mining

36

Presentation outline

● About terms, terminology, terminology mining

● Term Extraction● Term Alignment

● in parallel corpora● in comparable corpora

Page 37: Bilingual terminology mining

37

Advantages of comparable corpora

● More available● new languages● new language pairs● new topics / domains

● Less expensive to build● More natural

● data was produced spontaneously

● no influence from source text

Page 38: Bilingual terminology mining

38

Contextual approach

● Based on distributional linguistics (Z. Harris)

● Words with similar meaning appear in similar contexts

● If source and target words have similar contexts, they might be translations

● Compute contexts for each source and target word

● Compare contexts● Find the most similar contexts

Page 39: Bilingual terminology mining

39

Contextual approach

● Representation of the context of a given word with a vector :

● Head word + collocates

water

beer mou

th

glass

drink ● ● ● ... ●

● Vector associates « head » word with most frequent collocates

● + some indication of the force of association between head-word and collocates

Page 40: Bilingual terminology mining

40

Building context vector for « drink »

● Collocates : word occuring at a distance of n words from head

is variety of reasons to drink plenty of water each day

simple as a glass of drinking water be the key to the

popular in Japan today to drink water from glass after waking

● (drink,water) = 3● (drink, glass) = 2● (drink, Japan) = 1● (drink, reason) = 1● (drink, plenty) = 1

Page 41: Bilingual terminology mining

41

Normalized cooccurrences frequency

● Ex : log likelihood ratio● 1000 cooc. in corpus● (drink,x) = 75 cooc.● (water,y) = 75 cooc.● (drink, water) = 25 cooc.

water ¬ waterdrink 50 25 75¬ drink 25 900 925

75 925 1000

● Normalization : use measure like IM, log likehood ratio to counteract the influence of high frequency words

Dunning, 1993

Page 42: Bilingual terminology mining

42

Log likelyhood ratio

● loglikelihoo ratio (drink,water) = 45,05

water ¬ waterdrink a b e¬ drink c d h

f g N

● Contingency table :

log likelihood ratio water , drink =log a b log bc log c d log d N log N

−e log e − f log f −g log g −h logh

Dunning, 1993

Page 43: Bilingual terminology mining

43

Context vector comparison

● Compute context vectors for words in source and target corpus

water

beer mou

th

glass

drink ● ● ● ... ●

● How to compare words contexts in different languages ?

น��� เบ�ยร ป�

กแก�ว

ด ��ม ● ● ● ... ●

Rapp 1995 ; Fung 1997

Page 44: Bilingual terminology mining

44

Context vector comparison

● Use seed lexicon to map collocates

water

beer mou

th

glass

drink ● ● ● ... ●

น��� เบ�ยร ป�

กแก�ว

ด ��ม ● ● ● ... ●

thaï-englishseed lexicon

Rapp 1995 ; Fung 1997

Page 45: Bilingual terminology mining

45

Context vector comparison

● Measuring context similarity of words a and b

● = measuring cosinus angle between vector of a and vector of b

cosinus anglea ,b=∑c∈a∪b

w c , a⋅w c ,b

∑c∈a w c , a2 ⋅∑

c∈bw c ,b

2

c∈x=collocate in vector of xw c , x =weight of association of collocate c withhead x

● Select the top 1, 10 or 20 most closest words as candidate translations

Rapp 1995 ; Fung 1997

Page 46: Bilingual terminology mining

46

Contextual approach : improvements

● Using syntactic collocates● Improving dictionary with cognates,

transliterations, other dictionaries ● Give more weight to « anchor words »

● cognates, transliterations● frequent, monosemous

● Filter with part-of-speech● Favor reciprocal translations

cb

d

a

c'b'

d'

a'SOURCE TARGET

Chiao et Zweignebaum, 2002Sadat et al., 2003Gamallo and Campos, 2005Kohen and Knight, 2002Prochasson, 2010

Page 47: Bilingual terminology mining

47

Variant to direct translation of vector

● « Interlingual » translation● Translate the n-closest words instead of

context vector● Seed lexicon : some mappings between

source and target wordsSOURCE TARGET

seed lexicon

Déjean and Gaussier, 2002

Page 48: Bilingual terminology mining

48

Variant to direct translation of vector

● To translate term T : ● Find n-closest words● these closest words are in the lexicon

SOURCE TARGET

seed lexicon

Déjean and Gaussier, 2002

Page 49: Bilingual terminology mining

49

Variant to direct translation of vector

● Find the target term which is the closest to the n closest words

SOURCE TARGET

seed lexicon

Déjean and Gaussier, 2002

Page 50: Bilingual terminology mining

50

Variant to direct translation of vector

● « Interlingual » approach● Translate closest words instead of direct

context

SOURCE TARGET

Déjean and Gaussier, 2002

Page 51: Bilingual terminology mining

51

Adaptation to multi-word terms

● Context vector :● Union of vector of each word of the terms

Morin et al., 2004Morin and Daille, 2009

stron

gbe

er ...glass

energy ● ● ● ... ...

stron

gbe

er mouth

glass

energydrink

● ● ● ... ●

... beer mou

th

glass

drink ... ● ● ... ●

Page 52: Bilingual terminology mining

52

Evaluation

Single word units

big, general language corpus

80%

Multi-word units

small, specialized corpus

60%

Multi-word terms

small, specialized corpus

42%

● big = hundreds milliions of words● small = one million to 100 thousand

words vectorMorin and Daille, 2010

● Precison on TopN candidates● 50% on Top20● Correct translation is in the Top 20 best

candidates for 50% of source terms

Page 53: Bilingual terminology mining

53

Why is it so difficult ?

● translation might not be present● target term has not been extracted● polysemous words : undiscriminant,

fuzzy vector● low frequency words : unsignificant

vector● translation has different usage in target

language● big search space : all words of target

corpus→ can not be fully automatic→ semi supervised term alignment

Page 54: Bilingual terminology mining

54

Thank you

ed(a)lingua-et-machina.com

Franco-Thai Workshop 20104th intensive summer school on Natural Language Processing