bilingual terminology mining
DESCRIPTION
Material of the 4th Intensive Summer school and collaborative workshop on Natural Language Processing (NAIST Franco-Thai Workshop 2010). Bangkok, Thaıland.TRANSCRIPT
1
Bilingual Terminology Mining
Estelle Delpech30th November, 2010
4th intensive summer school on Natural Language Processing
2
About me
● Estelle Delpech
● Research engineer at Lingua et Machina, France
● CAT tools provider● ed(at)lingua-et-machina(dot)com● www.lingua-et-machina.com
● Ph. Candidate at LINA, France● taln team : specialises in NLP● estelle.delpech(at)univ-nantes(dot)fr
3
Presentation outline
● About terms, terminology, terminology mining
● Term Extraction● Term Alignment
4
Presentation outline
● About terms, terminology, terminology mining
● Term Extraction● Term Alignment
5
What is a term ?
● Classical definition : ● “unequivocal expression of a concept
within a technical domain“● Traces back to 1930 Eugene Wüster
« General Theory of Terminology »● Specialized language is / should be
unambiguous
concept
term referent
Ogden semiotic triangle
6
What is a term ?
“Classical terminology challenged in the 1990's by :
● sociolinguistics● corpus-based linguistics● computational terminology
● Observe terms in texts : ● there is variation, polysemy ● concepts evolve overtime● no clear-cut border between
specialized and general languages
7
What is a term ?
● Definition of « term » depends on the application / audience of the terminology
● Domain expert :● Unit of knowledge
● Information retrieval : ● Descriptors for indexation
● Translation ● word or phrase that :
● is not part of general language ● Translates differently in a particular
domain● can be :
● Noun, adjective, verb● Noun phrase, verb phrase, etc.
8
What is a terminology ?
● Set of terms + terminological records● Terminological record :
● Part-speech ● Frequency● Variants● contexts
● Relations between terms / concepts● Hypernoymy : cat is a sort of animal● Meronymy : head is part of body
● Bilingual terminology :● Translation relations
9
http://www.termiumplus.gc.ca/
10
Were do you find terms ?
● In specialized texts :● Research papers on breast cancer● Planes crashes reports
● Corpora building : ● important to gather texts following
a well-defined domain / thematic
11
term extractiondata mining
Bilingual terminology mining (1)
bilingualterminology
Specialized texts
termsterms
term alignment
terminologymanagement
software
12
synchronizedterm extractionand alignment
Bilingual terminology mining (2)
bilingualterminology
Specialized texts
terms terms
terminologymanagement
software
13
Presentation outline
● About terms, terminology, terminology mining
● Term Extraction● Term Alignment
Term extraction : semi-supervised process
● The notion of term is « slippery »● The same lexical unit may or may not be
considered as a term depending on : ● Audience● Domain● Application
● Term extractors extract candidate terms● Frequent in texts of a given domain
● HER2 gene● Look like terms : well-formed phrase
● human cell lines● Group of words that frequently occur
together ● to compile a programL'Homme, 2004
Term extraction : semi-supervised, lexico-semantic process
specialized texts
term extractor
manual selection
candidate terms candidate terms
terms
automaticindexing
terms
terminology
texts
terms
concepts
Termhood clues (1) : Frequency
● Term occurs frequently in specialized texts● the higher, the better ?
● Comparison with general language :● Does the term occur more frequently
than expected in general language ?
● Compute significance tests : ● ex : ² chi-square
L'Homme, 2004
17
Termhood clues (2) : form
● A term is a well-formed phrase● ...HER2/neu oncogenes are members of...
● Match morpho-syntactic patterns ● Ex: NOUN + NOUN
● Many : ● NOUN PREP DET NOUN● alternation of the gene
● NOUN PREP NOUN COORD ADJ NOUN● susceptibility to breast and ovarian cancer
● NOUN NOUN NOUN NOUN NOUN● human breast cancer cell lines
Termhood clues (2) : form
● Preprocessing : ● Tokenization ● Lemmatisation● POS Tagging
… HER-2/neu oncogenes are members of ....
HER-2/neu oncogenes are members of
NOUN NOUN VERB NOUN PREP
HER-2/neu oncogene be member of
Identification of Syntactic Patterns
● Patterns expressed as regular expression / Finite state automata
START
PREP
NOUN NOUN
NOUN (PREP? NOUN) ?
● NOUN : gene● NOUN NOUN : HER2 gene ● NOUN PREP NOUN : member of family
Term hood clue (3) : words association
● Significant coocurrences are good clues for term hood :
● … breast cancer … ● ...breast remains...● .. alternative cancer...
● Must take into account :● number of times the two word cooccur● number of times word A occurs● number of times word B occurs
Measure for cooccurrence significance
● Mutual InformationMI a ,b= log2
P a ,bP a⋅P b
P a , b=nbocc a ,b /NP a=nbocc a/N
N=total nbof words in corpus
● remarkable attraction between invasive and carcinoma despite relatively low number of cooccurrencesChurch and Hanks, 1990
L'Homme, 2004
invasive carcinoma 20
invasive 30
carcinoma 20
MI 9,7
cancer means 50
cancer 800
means 800
MI 1,69
22
Presentation outline
● About terms, terminology, terminology mining
● Term Extraction● Term Alignment
23
Presentation outline
● About terms, terminology, terminology mining
● Term Extraction● Term Alignment
● in parallel corpora● in comparable corpora
24
Parallel and comparable corpora
● Parallel corpora● Source text and target texts are translations● Reduce search space little by little
● First sentences● Then terms
● Comparable corpora● Not translation but very similar in topic ● Good proportion of terms translations● Search space :
● All terms of target corpus
25
Sentence alignement (1)
● Gale and Church (1993) 's hypothesis : ● Translated sentences have roughly the
same length● Probability P(S,T) that sentence S
translates into T is based on the length difference
● Improvements : use seed-lexicon● Probability P(S,T) is based on the
number of words in common
Gale and Church, 1993
26
Sentence alignement (2)
● Compute probabilites for all pairs of (S,T)● Build matrix where M(i,j) contains probability
that sentence i translates to sentence j
Gale and Church, 1993
0 1 2 ... n
0 0,89 0,56 0,2 ... ...
1 0,45 0,9 0,1 ... ...
2 ... 0,23 0,9 0,3 ...
... ... ... 0,44 0,76 ...
m ... ... ... ... 0,88
27
Sentence alignement (2)
● Use dynamic programming to find the best “path” i.e. the best alignments
Gale and Church, 1993
0 1 2 ... n
0 0,89 0,56 0,2 ... ...
1 0,45 0,9 0,1 ... ...
2 ... 0,23 0,9 0,3 ...
... ... ... 0,44 0,76 ...
m ... ... ... ... 0,88
28
Sub sentence alignment : AnyMalign (Lardilleux, 2010)
Lardilleux et al., 2010
● AnyMalign is a sub-sentencial aligner● Aligns words, groups of words for MT
translation tables● Aligned group of words :
● more or less like statistical collocations● possible to find term patterns in these
groups of words
29
AnyMalign (Lardilleux, 2010)
Lardilleux et al., 2010
a ↔ A is a perfect alignment
a d ↔ A D b ↔ Bb ↔ C
a e ↔ A DD
● Algorithm is based on « perfect alignments » :● words or groups of words that occur
exactly in the same aligned sentences
30
AnyMalign (Lardilleux, 2010)
Lardilleux et al., 2010
Sub corpora 1 : b ↔ BSub corpora 2 : a ↔ A
a d ↔ A D b ↔ Bb ↔ C
a e ↔ A DD
● How to get more « perfect alignments » ? ● with smaller corpora
● How to get smaller corpora ? ● randomly select sub corpora from your
corpora
Subcorpora 1
Subcorpora 2
Sub corpora 1 : b ↔ BSub corpora 2 : a ↔ A
31
AnyMalign (Lardilleux, 2010)
Lardilleux et al., 2010
a d ↔ A D b ↔ Bb ↔ C
a e ↔ A DD
● Complementaires of perfect alignments are likely to be good alignments too :
● Perfect alignment a ↔ A● Complementaries d ↔ De ↔ DD
32
AnyMalign (Lardilleux, 2010)
Lardilleux et al., 2010
● Process : Iteratively extract random samples of of random size from your corpora
● Extract « perfect alignements » and their complementary
● The same alignment can occur several times
● Count, for each alignement the number of times it occurs
33
AnyMalign (Lardilleux, 2010)
Lardilleux et al., 2010
● Output : ● alignments sorted by descending number of
occurrences● Alignement probability :
P S∣T =C S ,T C T
S = source group of wordsT = target group of wordsC (S,T) = number of times S was aligned with
TC (T) = number of times T appears in an
alignment
34
AnyMalign (Lardilleux, 2010)
Lardilleux et al., 2010
Advantages :● can perform alignment with more than 2
languages at the same time● 1 language → statistical collocations
● Extracts and aligns non contiguous sequences of words
to give something upto let someone down
● No a priori expectations on terms● Sometimes a term in source
language is not translated by a term● Terms = what you can align
35
AnyMalign (Lardilleux, 2010)
Lardilleux et al., 2010
● Words groups are not grammatical phrases :
that sample sentences and exchange format fitted for the
but not● Solutions :
● find term patterns● use heuristics
● trim stop words
sample sentences exchange format
36
Presentation outline
● About terms, terminology, terminology mining
● Term Extraction● Term Alignment
● in parallel corpora● in comparable corpora
37
Advantages of comparable corpora
● More available● new languages● new language pairs● new topics / domains
● Less expensive to build● More natural
● data was produced spontaneously
● no influence from source text
38
Contextual approach
● Based on distributional linguistics (Z. Harris)
● Words with similar meaning appear in similar contexts
● If source and target words have similar contexts, they might be translations
● Compute contexts for each source and target word
● Compare contexts● Find the most similar contexts
39
Contextual approach
● Representation of the context of a given word with a vector :
● Head word + collocates
water
beer mou
th
glass
drink ● ● ● ... ●
● Vector associates « head » word with most frequent collocates
● + some indication of the force of association between head-word and collocates
40
Building context vector for « drink »
● Collocates : word occuring at a distance of n words from head
is variety of reasons to drink plenty of water each day
simple as a glass of drinking water be the key to the
popular in Japan today to drink water from glass after waking
● (drink,water) = 3● (drink, glass) = 2● (drink, Japan) = 1● (drink, reason) = 1● (drink, plenty) = 1
41
Normalized cooccurrences frequency
● Ex : log likelihood ratio● 1000 cooc. in corpus● (drink,x) = 75 cooc.● (water,y) = 75 cooc.● (drink, water) = 25 cooc.
water ¬ waterdrink 50 25 75¬ drink 25 900 925
75 925 1000
● Normalization : use measure like IM, log likehood ratio to counteract the influence of high frequency words
Dunning, 1993
42
Log likelyhood ratio
● loglikelihoo ratio (drink,water) = 45,05
water ¬ waterdrink a b e¬ drink c d h
f g N
● Contingency table :
log likelihood ratio water , drink =log a b log bc log c d log d N log N
−e log e − f log f −g log g −h logh
Dunning, 1993
43
Context vector comparison
● Compute context vectors for words in source and target corpus
water
beer mou
th
glass
drink ● ● ● ... ●
● How to compare words contexts in different languages ?
น��� เบ�ยร ป�
กแก�ว
ด ��ม ● ● ● ... ●
Rapp 1995 ; Fung 1997
44
Context vector comparison
● Use seed lexicon to map collocates
water
beer mou
th
glass
drink ● ● ● ... ●
น��� เบ�ยร ป�
กแก�ว
ด ��ม ● ● ● ... ●
thaï-englishseed lexicon
Rapp 1995 ; Fung 1997
45
Context vector comparison
● Measuring context similarity of words a and b
● = measuring cosinus angle between vector of a and vector of b
cosinus anglea ,b=∑c∈a∪b
w c , a⋅w c ,b
∑c∈a w c , a2 ⋅∑
c∈bw c ,b
2
c∈x=collocate in vector of xw c , x =weight of association of collocate c withhead x
● Select the top 1, 10 or 20 most closest words as candidate translations
Rapp 1995 ; Fung 1997
46
Contextual approach : improvements
● Using syntactic collocates● Improving dictionary with cognates,
transliterations, other dictionaries ● Give more weight to « anchor words »
● cognates, transliterations● frequent, monosemous
● Filter with part-of-speech● Favor reciprocal translations
cb
d
a
c'b'
d'
a'SOURCE TARGET
Chiao et Zweignebaum, 2002Sadat et al., 2003Gamallo and Campos, 2005Kohen and Knight, 2002Prochasson, 2010
47
Variant to direct translation of vector
● « Interlingual » translation● Translate the n-closest words instead of
context vector● Seed lexicon : some mappings between
source and target wordsSOURCE TARGET
seed lexicon
Déjean and Gaussier, 2002
48
Variant to direct translation of vector
● To translate term T : ● Find n-closest words● these closest words are in the lexicon
SOURCE TARGET
seed lexicon
Déjean and Gaussier, 2002
49
Variant to direct translation of vector
● Find the target term which is the closest to the n closest words
SOURCE TARGET
seed lexicon
Déjean and Gaussier, 2002
50
Variant to direct translation of vector
● « Interlingual » approach● Translate closest words instead of direct
context
SOURCE TARGET
Déjean and Gaussier, 2002
51
Adaptation to multi-word terms
● Context vector :● Union of vector of each word of the terms
Morin et al., 2004Morin and Daille, 2009
stron
gbe
er ...glass
energy ● ● ● ... ...
stron
gbe
er mouth
glass
energydrink
● ● ● ... ●
... beer mou
th
glass
drink ... ● ● ... ●
52
Evaluation
Single word units
big, general language corpus
80%
Multi-word units
small, specialized corpus
60%
Multi-word terms
small, specialized corpus
42%
● big = hundreds milliions of words● small = one million to 100 thousand
words vectorMorin and Daille, 2010
● Precison on TopN candidates● 50% on Top20● Correct translation is in the Top 20 best
candidates for 50% of source terms
53
Why is it so difficult ?
● translation might not be present● target term has not been extracted● polysemous words : undiscriminant,
fuzzy vector● low frequency words : unsignificant
vector● translation has different usage in target
language● big search space : all words of target
corpus→ can not be fully automatic→ semi supervised term alignment
54
Thank you
ed(a)lingua-et-machina.com
Franco-Thai Workshop 20104th intensive summer school on Natural Language Processing