![Page 1: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange](https://reader030.vdocuments.net/reader030/viewer/2022020218/55795495d8b42ab6648b493d/html5/thumbnails/1.jpg)
Dealing with Lexicon Acquired from Comparable Corpora
Post-edition and Exchange
Estelle Delpech, Lingua et MachinaBéatrice Daille, U. de Nantes - LINA
1/23
![Page 2: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange](https://reader030.vdocuments.net/reader030/viewer/2022020218/55795495d8b42ab6648b493d/html5/thumbnails/2.jpg)
Working w/ lexicon acquired from comparable corpora
I. Terminology acquisition from comparable corpora : quick overview
II. A tool for terminology post-edition
III. Data exchange : a TBX variant for automatically acquired lexicons
IV. Future work
2/23
![Page 3: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange](https://reader030.vdocuments.net/reader030/viewer/2022020218/55795495d8b42ab6648b493d/html5/thumbnails/3.jpg)
Part I
Terminology Acquisition from Comparable Corpora
3/23
![Page 4: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange](https://reader030.vdocuments.net/reader030/viewer/2022020218/55795495d8b42ab6648b493d/html5/thumbnails/4.jpg)
Terminology acquisition from comparable corpora Comparable corpora:
“Two corpora, respectively in two languages l1 and l2 are said ”comparable” if there exists a substantial part of the vocabulary of the corpus in language l1 whose translation can be found in the corpus in language l2.”
(my translation of [Déjan and Gaussier, 2002] )
Advantages : Availabily Real usages
4/23
![Page 5: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange](https://reader030.vdocuments.net/reader030/viewer/2022020218/55795495d8b42ab6648b493d/html5/thumbnails/5.jpg)
Terminology acquisition from comparable corpora
Terminology extraction : a contextual analysis Compare contexts of source and target terms If contexts are similar, there's a good chance
source and target terms are translations of each other, ex :
mastectomy : reconstruction, prophylactic, treat, undergo, removal
mastectomie : reconstruction, prophylactique, traiter, subir, ablation
5/23
![Page 6: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange](https://reader030.vdocuments.net/reader030/viewer/2022020218/55795495d8b42ab6648b493d/html5/thumbnails/6.jpg)
Terminology acquisition from comparable corpora
Results Not as good as acquisition from parallel corpora ! Fung (1997) : 30 % accuracy on the Top20
candidates Morin et al. (2004) : translation is usually the 34th for
complex terms
0,92 ablation
0,48 opération
mastectomy 0,89 mastectomie
6/23
Outputs one-to-many alignments– Evaluation : precision on the TopNBest alignments
![Page 7: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange](https://reader030.vdocuments.net/reader030/viewer/2022020218/55795495d8b42ab6648b493d/html5/thumbnails/7.jpg)
Part II
A Tool for Post-edition
7/23
![Page 8: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange](https://reader030.vdocuments.net/reader030/viewer/2022020218/55795495d8b42ab6648b493d/html5/thumbnails/8.jpg)
A tool for post-edition
Existing Tools : iView (Merkel and Foo, 2007) ArayaTermExtractor (Waldhör 2006) Xerox Terminology Suite ®
Our needs : Deal with one-to-many alignments Non-aligned contexts Allow non binary annotation Display useful information to help finding the right
candidate in the corpus8/23
![Page 9: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange](https://reader030.vdocuments.net/reader030/viewer/2022020218/55795495d8b42ab6648b493d/html5/thumbnails/9.jpg)
“Useful” information
→ Knownledge that helps catching the in vivo behavior terms
→Text-driven, term-oriented approach Useful information :
Variants Collocations Distributional neighbors Contexts
→ To be harvested during the term extraction / alignment process
9/23
![Page 10: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange](https://reader030.vdocuments.net/reader030/viewer/2022020218/55795495d8b42ab6648b493d/html5/thumbnails/10.jpg)
Useful information : example
Mastectomy Mastectomie
risk reducting ~simple ~
~ préventive~ simple
TumorectomyLumpectomyOophorectomy
TumorectomieAblationOpération
...patient may choose to have risk-reducing bilateral mastectomy if they have a strong family history of breast cancer...
...la mastectomie préventive pourrait supprimer la grande majorité du risque de développer un cancer...
10/23
![Page 11: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange](https://reader030.vdocuments.net/reader030/viewer/2022020218/55795495d8b42ab6648b493d/html5/thumbnails/11.jpg)
Post-edition interface http://80.82.238.151/Metricc/InterfaceValidation, user “test”, no password
11/23
![Page 12: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange](https://reader030.vdocuments.net/reader030/viewer/2022020218/55795495d8b42ab6648b493d/html5/thumbnails/12.jpg)
Part III
Data Exchange : a TBX variant for
automatically acquired lexicon
12/23
![Page 13: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange](https://reader030.vdocuments.net/reader030/viewer/2022020218/55795495d8b42ab6648b493d/html5/thumbnails/13.jpg)
Quick introduction to TBX (1)
TBX : Term Base eXchange Open, XML-based standard for exchanging
structured terminological data approved as an international standard by LISA
and ISO (norm 30042) Maps to TMF data model Subset of MARTIF Designed for various use cases Customizable
13/23
![Page 14: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange](https://reader030.vdocuments.net/reader030/viewer/2022020218/55795495d8b42ab6648b493d/html5/thumbnails/14.jpg)
Quick introduction to TBX (2)
2 components : Structure : core structure based on TMF
metamodel Content : formalism to express data-categories
and their constraints
Adapted from ISO norm 30042:2008, Fig. 4, p.30
Default XCS XCS1 XCSn
Default TBX TBX variant 1
Core DTD/Schema
Form Content
TBX variant n 14/23
![Page 15: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange](https://reader030.vdocuments.net/reader030/viewer/2022020218/55795495d8b42ab6648b493d/html5/thumbnails/15.jpg)
Quick introduction to TBX (3)
Taken from ISO norm 30042:2008, Fig. 1, p.9
responsability
respPerson
termType
usageNote
corpusTrace
reliabilityCode
partOfSpeech
Form defined in DTD Content defined in XCS
15/23
![Page 16: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange](https://reader030.vdocuments.net/reader030/viewer/2022020218/55795495d8b42ab6648b493d/html5/thumbnails/16.jpg)
TBX variant for lexicon acquired from comparable corpora
Default TBX data-categories termType : entryTerm, variant externalCrossReference, usageNote partOfSpeech, frequency, reliabilityCode... transactionType, responsability
+ Customized data-categories : occurrences, occurrenceCount relatedTerm termDefinition, definitionRelevance ntigReference 16/23
![Page 17: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange](https://reader030.vdocuments.net/reader030/viewer/2022020218/55795495d8b42ab6648b493d/html5/thumbnails/17.jpg)
TBX variant : A term entry
17/23
![Page 18: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange](https://reader030.vdocuments.net/reader030/viewer/2022020218/55795495d8b42ab6648b493d/html5/thumbnails/18.jpg)
TBX variant : 1-to-n alignments
18/23
![Page 19: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange](https://reader030.vdocuments.net/reader030/viewer/2022020218/55795495d8b42ab6648b493d/html5/thumbnails/19.jpg)
TBX variant : approved alignment
19/23
![Page 20: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange](https://reader030.vdocuments.net/reader030/viewer/2022020218/55795495d8b42ab6648b493d/html5/thumbnails/20.jpg)
Feed-back on TBX TBX is made for stable terminologies with little
uncertainy on the status of translations not machine-generated lexicons of “candidate translations” : difficult to separate of term + properties from its
alignments
no data category specific to automatically estimated reliability
Difficult to make text-driven, term-oriented knowledge fit in a concept oriented format no definition category that would apply to a single term
and not the whole concept
![Page 21: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange](https://reader030.vdocuments.net/reader030/viewer/2022020218/55795495d8b42ab6648b493d/html5/thumbnails/21.jpg)
Conclusion
Future work
21/23
![Page 22: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange](https://reader030.vdocuments.net/reader030/viewer/2022020218/55795495d8b42ab6648b493d/html5/thumbnails/22.jpg)
Future work
Integration of prototype in Libellex TBX import / export edition of linguistic properties
User testing (ergonomics) Evaluation of added-value for translation Explore new ways of :
aligning terms selecting contexts
22/23
![Page 23: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange](https://reader030.vdocuments.net/reader030/viewer/2022020218/55795495d8b42ab6648b493d/html5/thumbnails/23.jpg)
References Post-edition prototype on line : http://80.82.238.151/Metricc/InterfaceValidation/ user “test”,
no password
Metricc project : http://www.metricc.com/
Lingua et Machina : http://www.lingua-et-machina.com/
Comparable corpora : Déjean, H., Gaussier, É. (2002) : “Une nouvelle approche à l'extraction de lexiques bilingues à partir de corpus comparables”, In Lexicometrica, Alignement Lexical dans les corpus multilingues, pp.1-22.
ArayaTermExtractor : http://www.heartsome.de
Xerox Terminology Suite : http://www.temis.com/
Iview : Nyström, M., Merkel, M., Ahrenberg, L., Zweignebaum, P., Petersson, H. and Åhlfeldt H. (2006) : “Creating a medical English-Swedish dictionary using interactive word alignment”', In BMC Medical Informatics and Decision Making, 2006, pp. 6-35
TMF : ISO 16642 - Terminological markup framework
TBX : ISO 30042 - Systems to manage terminology, knowledge and content -- TermBase eXchange (TBX)
Data categories : ISO 12620 - Terminology and other language and content resources -- Specification of data categories and management of a Data Category Registry for language resources