georgios kontonatsios [email protected] 14 th october 2014
DESCRIPTION
A hybrid approach to compiling bilingual dictionaries of medical terms from parallel corpora. Georgios Kontonatsios [email protected] 14 th October 2014. Overview. Parallel Corpus Problem Motivation. Background. Random Forest Classifier Statistical Phrase Alignment - PowerPoint PPT PresentationTRANSCRIPT
1
Georgios Kontonatsios
[email protected] October 2014
A hybrid approach to compiling bilingual dictionaries of medical terms from parallel corpora
14/10/2014
2
Overview
Background
Methods
Experiments
Conclusions
Parallel Corpus Problem Motivation
Random Forest Classifier Statistical Phrase Alignment Hybrid Approach
English-Greek & English-Romanian Error Analysis
Discussion Future Work
3
Background: Parallel Corpus
“A parallel corpus is a collection of documents in a source language pairedwith their direct translation in a target language”
Abraxane monotherapy is indicated for the treatment of metastatic breast cancer
η µονοθεραπεία µε abraxane ενδείκνυται για τη θεραπεία µεταστατικού καρκίνου µαστού
English
Greek
4
Background: Parallel Corpus
Abraxane monotherapy is indicated for the treatment of metastatic breast cancer
English
η µονοθεραπεία µε abraxane ενδείκνυται για τη θεραπεία µεταστατικού καρκίνου του µαστού
Greek
1) Useful for SMT
2) Relatively scarce resources• Koehn (2005) trained 110 SMT systems (11 languages)
in three weeks.• Available finance, law, medicine etc.
3) Excellent resources for mining bilingual terminologies• Exact translations => No missing translations of terms• sentence aligned => limited search space of candidate translations• Same size => term frequencies are comparable
5
Background: Problem
ParallelCorpus
Term Alignment
Dictionaryof MWT
Abraxane monotherapy is indicated for the treatment of metastatic breast cancer
η µονοθεραπεία µε abraxane ενδείκνυται για τη θεραπεία µεταστατικού καρκίνου µαστού
metastatic breast cancerµεταστατικού καρκίνου µαστού
7
Background: Biomedical Domain
Existing resources in the biomedical domain remain incomplete
UMLS• A multilingual terminological resource (more than 20 languages)
• Indexes ~7.6M English terms
Czech Dutch French German Hungarian Italian Japanese Polish Portuguese Russian Spanish0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
16.00%
18.00%
1.72%2.88% 2.59% 2.43%
1.21%1.79%
3.26%
0.55%
2.06%1.40%
16.44%
%Co
vera
ge o
f Eng
lish
UM
LS
~6.3M missing tranlsations
expand UMLS for English-Greek and English-Romanian
8
Methodology: Term Alignment Pipeline
Abraxane monotherapy is indicated for the treatment of metastatic breast cancer
η µονοθεραπεία µε abraxane ενδείκνυται για τη θεραπεία µεταστατικού καρκίνου µαστού
ParallelCorpus
MetaMap
C0278488, Neoplastic Process
Term Alignment
Link toUMLS
C0278488, Neoplastic Process
9
Methodology: Term Alignment Algorithms
Random Forest Classifier(EACL 2014, EMNLP 2014)
• Supervised machine learning method
• Exploits internal structure of terms
(character n-gram feature representation)
• Requires positive and negative instances for training
• Out-of-domain seed dictionary (i.e. BabelNet)
Statistical Phrase Alignment(Koehn et al., 2003)
• Unsupervised approach
• Part of Moses SMT (Koehn et al., 2007)
(Out of the box solution)
• Exploits co-occurrences of source and target terms
• Works well for frequently occurring terms
• Performance decreases for rare terms
13
Methodology: Hybrid Approach
For s to be translated, RF and SPA suggest N ranked candidate translations
Classification margin
Translationprobability
type 2 diabetes mellitus
SPA RF
1) διαβήτη τύπου 22) διαβήτη τύπου 2 και καρδιακή 3) σακχαρώδη διαβήτη τύπου 2
1) του σακχαρώδη διαβήτη τύπου 2 2) σακχαρώδη διαβήτη τύπου 23) σακχαρώδους διαβήτη τύπου 2
14
Methodology: Hybrid Approach
1) σακχαρώδη διαβήτη τύπου 2
Voting
type 2 diabetes mellitus
SPA RF
1) διαβήτη τύπου 22) διαβήτη τύπου 2 και καρδιακή 3) σακχαρώδη διαβήτη τύπου 2
1) του σακχαρώδη διαβήτη τύπου 2 2) σακχαρώδη διαβήτη τύπου 23) σακχαρώδους διαβήτη τύπου 2
Dictionaries containing N candidate translations have a limited number of applications (e.g., SMT)
To enrich existing terminologies, human curators need to post-edit the output of term alignment methods
Objective is to improve the precision of higher ranking candidates (precision@N=1)
Intersection of RF and SPA; ranking candidates according to translation probability by SPA
15
Experiments: Corpora
EMEA (Tiedemann, 2009), a biomedical parallel corpus from European Medicines Agency- 1.5K sentence aligned documents in 22 languages- Drug usage guidelines
en el en ro
- 372K sentences- 17,907 unique English MWTs
- 321K sentences- 16,625 unique English MWTs
16
Experiments: Evaluation
Randomly sampled 1,000 English MWTs for each English MWT, we selected the top 20 translation candidates.
RF SPA Voting
en-el
RF SPA Voting
en-ro
18
0 1 2 3 4 5 6 7 8 9 10111213141516171819200.3
0.4
0.5
0.6
0.7
0.8
0.9
1RF SPA RF+SPA
# candidate translations per source term
Prec
isio
n
Experiments: Results
English-Greek dataset
¿𝑠 𝑓𝑜𝑟 h h𝑤 𝑖𝑐 h𝑡 𝑒𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑡𝑟𝑎𝑛𝑠𝑙𝑎𝑡𝑖𝑜𝑛𝑖𝑠 𝑎𝑚𝑜𝑛𝑔𝑁𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠¿𝑠 : h𝑡 𝑒𝑚𝑜𝑑𝑒𝑙 𝑠𝑢𝑔𝑔𝑒𝑠𝑡𝑒𝑑𝑎𝑡 𝑙𝑒𝑎𝑠𝑡𝑜𝑛𝑒𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒
19
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.3
0.4
0.5
0.6
0.7
0.8
0.9
1
RF SPA RF+SPA
# candidate translations per source term
Prec
isio
n
Experiments: Results
English-Romanian dataset
20
Experiments: Results
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
RF SPA RF+SPA
# candidate translations per source term
Reca
ll
English-Greek dataset
¿𝑠 𝑓𝑜𝑟 h h𝑤 𝑖𝑐 h𝑡 𝑒𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑡𝑟𝑎𝑛𝑠𝑙𝑎𝑡𝑖𝑜𝑛𝑖𝑠 𝑎𝑚𝑜𝑛𝑔𝑁𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠¿ 𝑠=1000
21
Experiments: Results
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.2
0.250.3
0.350.4
0.450.5
0.550.6
0.650.7
RF SPA RF+SPA
# candidate translations per source term
Reca
ll
English-Romanian dataset
22
Error Analysis
RF
Partial matches
urea cycle disorder διαταραχών του κύκλου της ουρίας
discontinuous translations
metabolic diseases boli ereditare de metabolism
SPA
Statistically-based tool. -Performance largely affected by term frequency
top-20 precision on terms having varying frequency
(disorder) (cycle) (urea)
(diseases) (metabolic)(hereditary)
23
Error Analysis
English-Greek dataset
[100 200] [50 100] [25 50] [15 25] [10 15] [5 10] [1 5]0
0.10.20.30.40.50.60.70.80.9
1
SPA RF RF + SPA
frequency ranges
Top-
20 P
reci
sion
Performance decreases for lower
frequency terms
24
[100 200] [50 100] [25 50] [15 25] [10 15] [5 10] [1 5]0
0.10.20.30.40.50.60.70.80.9
1
SPA RF RF + SPA
frequency ranges
top-
20 P
reci
sion
Error Analysis
English-Romanian dataset
25
Discussion
Observations:
• Substantially improves top-1 precision of RF and SPA
• Outperforms SPA when translating low-frequency terms
• Low recall
Hybrid approach
• Compilation of bilingual terminologies from parallel corpora
• Enrich UMLS with two under-resource languages
26
Future Work
Parallelcorpus
SPA
Investigate integration of bilingual terminologies with SMT
Phrasetable
LM
SMT
Lower top-1 precisionPoor performance for low-frequency terms
SPA
RF
27
Questions ?