interverbum falcon-10oct14-az
TRANSCRIPT
FALCON: Building the Localization Web
Andrzej ZydrońXTM International
Dave LewisCNGL at Trinity College Dublin
• Open Data on the Web: W3C Semantic Web standards allow data to be published on Web – Fine-grained URI-based inter-linking – Extensible meta-data– Standard Query APIs
• Enables a Localization Web – Terms and translations become linkable resources– Meta-data from L10n workflows adds value– Leverage in training Machine Translation and Text
Analytics
The Localization Web
The Localization Web = Decentralised Annotated Global Translation Memory and Term Base
Linguistic Linked Data: Lexicons
Red
Phonetic formForm
numbersingular
[RED]
Form
plural[REDES]
Phonetic form
number
RedSense
written form“red”
Sensewritten form
“malla”
equivalent
Red
image
Red
Sense Sensetranslation
es - enwritten form
“red” “network”
written form
Red
written form
Formgender
femenine
“red”
Use Case: BabelNet.org
Words as Resources on the Web
Barak Obama if the 44th president of the United State of America. He was first elected in 2009.
Barak Obama si el 44 º presidente de los Estados Unidos de América. Ha fue electo primera vez en 2009.
http:// www.ex.org/obama_en.html
http:// www.ex.org/obama_es.html
The Web of Content The Localization Web
http://data.ex.org/String_0001
http:// data.ex.org/String_0002
DerivedFrom
DerivedFrom
Text: “Barak Obama if the 44th president of the United State of America.”Lang:en
Text:“Barak Obama si el 44 º presidente de los Estados Unidos de América.”Lang:esTranslatedBy:Google Translate
Translated From
Translation Data
Term: “United State of America.”Lang:en
Term:“Estados Unidos de América.”Lang:es
Translation Of
http:// babelnet.org/345621
http:// babelnet.org/57835
Terminology Data
Topic: Barak ObamaLang: enBirthDate: 1961-08-04 Spouse: Michelle ObamaResidence: White House
http:// Dbpedia.org/Page/Barak_Obama
Encyclopaedic Data
Language
Resources
Language
Workers
Language
Technology
Language Lifecycle Dependencies
Parallel Text & Term base
Posteditors
Machine
Translation
Active Curation: Dynamic MT Retraining
• Tighten curation cycle: from projects to segments
– Prioritise postedits for retraining
• Prioritise Term Identification by posteditors
• Assemble MT-ready, lexically-rich term bases
• TermWeb/XTM/DCU• Introducing Next Gen Machine Translation
• Massive scale bilingual dictionaries• BabelNet
• NER: forced decoding• Dynamic retraining• Optimal segment translation route• L3Data curation, sharing
Next Generation Machine Translation
Data Management Lifecycles
Publish
Correct & refine
Lex-concept lifecycleCorrect &
refineDiscover &
use
Discover & use
Correct & refine
Bitext lifecycle
Discover data
(Re)train-MT
Revise and annotate
Publish
Content lifecycle
Publish
I18n & source QA
Trans QA
Post-edit
Automated translation
Consume Create
• Better in-context postediting: – XTM-Easyling
• Feeding term suggestions from posteditor to Terminology Management– XTM-Interverbum
• Dynamic Retraining– XTM-DCU
• Bilingual Dictionary SMT improvements– XTM-DCU
• NER, terminology enforcements, forced decoding– XTM-Interverbum-DCU
• Postediting prioritisation and term flagging– TCD-DCU-XTM
• Publishing interlinks of parallel text, lexically rich term bases– TCD: DG-T TM, EurVoc, Snomed-CT, LEMON, BabelNet
• Closing the loop – operational instrumentation of postediting– Translog, iOmegaT
Systems Integration
More Information
• Contact: [email protected]• https://www.w3.org/International/its/wiki/Open_Data_Management_for_Public_Aut
omated_Translation_Services• http://www.falcon-project.eu
• See also: – Linked Data for Language Technology (LD4LT) W3C
Community Group– http://www.w3.org/community/ld4lt/