georgios kontonatsios [email protected] 14 th october 2014

22
Georgios Kontonatsios [email protected] 14 th October 2014 A hybrid approach to compiling bilingual dictionaries of medical terms from parallel corpora 1 14/10/2014

Upload: logan-espinoza

Post on 04-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

A hybrid approach to compiling bilingual dictionaries of medical terms from parallel corpora. Georgios Kontonatsios [email protected] 14 th October 2014. Overview. Parallel Corpus Problem Motivation. Background. Random Forest Classifier Statistical Phrase Alignment - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Georgios  Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th  October 2014

1

Georgios Kontonatsios

[email protected] October 2014

A hybrid approach to compiling bilingual dictionaries of medical terms from parallel corpora

14/10/2014

Page 2: Georgios  Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th  October 2014

2

Overview

Background

Methods

Experiments

Conclusions

Parallel Corpus Problem Motivation

Random Forest Classifier Statistical Phrase Alignment Hybrid Approach

English-Greek & English-Romanian Error Analysis

Discussion Future Work

Page 3: Georgios  Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th  October 2014

3

Background: Parallel Corpus

“A parallel corpus is a collection of documents in a source language pairedwith their direct translation in a target language”

Abraxane monotherapy is indicated for the treatment of metastatic breast cancer

η µονοθεραπεία µε abraxane ενδείκνυται για τη θεραπεία µεταστατικού καρκίνου µαστού

English

Greek

Page 4: Georgios  Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th  October 2014

4

Background: Parallel Corpus

Abraxane monotherapy is indicated for the treatment of metastatic breast cancer

English

η µονοθεραπεία µε abraxane ενδείκνυται για τη θεραπεία µεταστατικού καρκίνου του µαστού

Greek

1) Useful for SMT

2) Relatively scarce resources• Koehn (2005) trained 110 SMT systems (11 languages)

in three weeks.• Available finance, law, medicine etc.

3) Excellent resources for mining bilingual terminologies• Exact translations => No missing translations of terms• sentence aligned => limited search space of candidate translations• Same size => term frequencies are comparable

Page 5: Georgios  Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th  October 2014

5

Background: Problem

ParallelCorpus

Term Alignment

Dictionaryof MWT

Abraxane monotherapy is indicated for the treatment of metastatic breast cancer

η µονοθεραπεία µε abraxane ενδείκνυται για τη θεραπεία µεταστατικού καρκίνου µαστού

metastatic breast cancerµεταστατικού καρκίνου µαστού

Page 6: Georgios  Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th  October 2014

7

Background: Biomedical Domain

Existing resources in the biomedical domain remain incomplete

UMLS• A multilingual terminological resource (more than 20 languages)

• Indexes ~7.6M English terms

Czech Dutch French German Hungarian Italian Japanese Polish Portuguese Russian Spanish0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

14.00%

16.00%

18.00%

1.72%2.88% 2.59% 2.43%

1.21%1.79%

3.26%

0.55%

2.06%1.40%

16.44%

%Co

vera

ge o

f Eng

lish

UM

LS

~6.3M missing tranlsations

expand UMLS for English-Greek and English-Romanian

Page 7: Georgios  Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th  October 2014

8

Methodology: Term Alignment Pipeline

Abraxane monotherapy is indicated for the treatment of metastatic breast cancer

η µονοθεραπεία µε abraxane ενδείκνυται για τη θεραπεία µεταστατικού καρκίνου µαστού

ParallelCorpus

MetaMap

C0278488, Neoplastic Process

Term Alignment

Link toUMLS

C0278488, Neoplastic Process

Page 8: Georgios  Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th  October 2014

9

Methodology: Term Alignment Algorithms

Random Forest Classifier(EACL 2014, EMNLP 2014)

• Supervised machine learning method

• Exploits internal structure of terms

(character n-gram feature representation)

• Requires positive and negative instances for training

• Out-of-domain seed dictionary (i.e. BabelNet)

Statistical Phrase Alignment(Koehn et al., 2003)

• Unsupervised approach

• Part of Moses SMT (Koehn et al., 2007)

(Out of the box solution)

• Exploits co-occurrences of source and target terms

• Works well for frequently occurring terms

• Performance decreases for rare terms

Page 9: Georgios  Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th  October 2014

13

Methodology: Hybrid Approach

For s to be translated, RF and SPA suggest N ranked candidate translations

Classification margin

Translationprobability

type 2 diabetes mellitus

SPA RF

1) διαβήτη τύπου 22) διαβήτη τύπου 2 και καρδιακή 3) σακχαρώδη διαβήτη τύπου 2

1) του σακχαρώδη διαβήτη τύπου 2 2) σακχαρώδη διαβήτη τύπου 23) σακχαρώδους διαβήτη τύπου 2

Page 10: Georgios  Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th  October 2014

14

Methodology: Hybrid Approach

1) σακχαρώδη διαβήτη τύπου 2

Voting

type 2 diabetes mellitus

SPA RF

1) διαβήτη τύπου 22) διαβήτη τύπου 2 και καρδιακή 3) σακχαρώδη διαβήτη τύπου 2

1) του σακχαρώδη διαβήτη τύπου 2 2) σακχαρώδη διαβήτη τύπου 23) σακχαρώδους διαβήτη τύπου 2

Dictionaries containing N candidate translations have a limited number of applications (e.g., SMT)

To enrich existing terminologies, human curators need to post-edit the output of term alignment methods

Objective is to improve the precision of higher ranking candidates (precision@N=1)

Intersection of RF and SPA; ranking candidates according to translation probability by SPA

Page 11: Georgios  Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th  October 2014

15

Experiments: Corpora

EMEA (Tiedemann, 2009), a biomedical parallel corpus from European Medicines Agency- 1.5K sentence aligned documents in 22 languages- Drug usage guidelines

en el en ro

- 372K sentences- 17,907 unique English MWTs

- 321K sentences- 16,625 unique English MWTs

Page 12: Georgios  Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th  October 2014

16

Experiments: Evaluation

Randomly sampled 1,000 English MWTs for each English MWT, we selected the top 20 translation candidates.

RF SPA Voting

en-el

RF SPA Voting

en-ro

Page 13: Georgios  Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th  October 2014

18

0 1 2 3 4 5 6 7 8 9 10111213141516171819200.3

0.4

0.5

0.6

0.7

0.8

0.9

1RF SPA RF+SPA

# candidate translations per source term

Prec

isio

n

Experiments: Results

English-Greek dataset

¿𝑠 𝑓𝑜𝑟 h h𝑤 𝑖𝑐 h𝑡 𝑒𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑡𝑟𝑎𝑛𝑠𝑙𝑎𝑡𝑖𝑜𝑛𝑖𝑠 𝑎𝑚𝑜𝑛𝑔𝑁𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠¿𝑠 : h𝑡 𝑒𝑚𝑜𝑑𝑒𝑙 𝑠𝑢𝑔𝑔𝑒𝑠𝑡𝑒𝑑𝑎𝑡 𝑙𝑒𝑎𝑠𝑡𝑜𝑛𝑒𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒

Page 14: Georgios  Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th  October 2014

19

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.3

0.4

0.5

0.6

0.7

0.8

0.9

1

RF SPA RF+SPA

# candidate translations per source term

Prec

isio

n

Experiments: Results

English-Romanian dataset

Page 15: Georgios  Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th  October 2014

20

Experiments: Results

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

RF SPA RF+SPA

# candidate translations per source term

Reca

ll

English-Greek dataset

¿𝑠 𝑓𝑜𝑟 h h𝑤 𝑖𝑐 h𝑡 𝑒𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑡𝑟𝑎𝑛𝑠𝑙𝑎𝑡𝑖𝑜𝑛𝑖𝑠 𝑎𝑚𝑜𝑛𝑔𝑁𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠¿ 𝑠=1000

Page 16: Georgios  Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th  October 2014

21

Experiments: Results

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.2

0.250.3

0.350.4

0.450.5

0.550.6

0.650.7

RF SPA RF+SPA

# candidate translations per source term

Reca

ll

English-Romanian dataset

Page 17: Georgios  Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th  October 2014

22

Error Analysis

RF

Partial matches

urea cycle disorder διαταραχών του κύκλου της ουρίας

discontinuous translations

metabolic diseases boli ereditare de metabolism

SPA

Statistically-based tool. -Performance largely affected by term frequency

top-20 precision on terms having varying frequency

(disorder) (cycle) (urea)

(diseases) (metabolic)(hereditary)

Page 18: Georgios  Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th  October 2014

23

Error Analysis

English-Greek dataset

[100 200] [50 100] [25 50] [15 25] [10 15] [5 10] [1 5]0

0.10.20.30.40.50.60.70.80.9

1

SPA RF RF + SPA

frequency ranges

Top-

20 P

reci

sion

Performance decreases for lower

frequency terms

Page 19: Georgios  Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th  October 2014

24

[100 200] [50 100] [25 50] [15 25] [10 15] [5 10] [1 5]0

0.10.20.30.40.50.60.70.80.9

1

SPA RF RF + SPA

frequency ranges

top-

20 P

reci

sion

Error Analysis

English-Romanian dataset

Page 20: Georgios  Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th  October 2014

25

Discussion

Observations:

• Substantially improves top-1 precision of RF and SPA

• Outperforms SPA when translating low-frequency terms

• Low recall

Hybrid approach

• Compilation of bilingual terminologies from parallel corpora

• Enrich UMLS with two under-resource languages

Page 21: Georgios  Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th  October 2014

26

Future Work

Parallelcorpus

SPA

Investigate integration of bilingual terminologies with SMT

Phrasetable

LM

SMT

Lower top-1 precisionPoor performance for low-frequency terms

SPA

RF

Page 22: Georgios  Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th  October 2014

27

Questions ?