interverbum falcon-10oct14-az

11
FALCON: Building the Localization Web Andrzej Zydroń XTM International Dave Lewis CNGL at Trinity College Dublin

Upload: andrzej-zydron-mbcs

Post on 20-Mar-2017

158 views

Category:

Internet


0 download

TRANSCRIPT

Page 1: Interverbum falcon-10oct14-az

FALCON: Building the Localization Web

Andrzej ZydrońXTM International

Dave LewisCNGL at Trinity College Dublin

Page 2: Interverbum falcon-10oct14-az

• Open Data on the Web: W3C Semantic Web standards allow data to be published on Web – Fine-grained URI-based inter-linking – Extensible meta-data– Standard Query APIs

• Enables a Localization Web – Terms and translations become linkable resources– Meta-data from L10n workflows adds value– Leverage in training Machine Translation and Text

Analytics

The Localization Web

The Localization Web = Decentralised Annotated Global Translation Memory and Term Base

Page 3: Interverbum falcon-10oct14-az

Linguistic Linked Data: Lexicons

Red

Phonetic formForm

numbersingular

[RED]

Form

plural[REDES]

Phonetic form

number

RedSense

written form“red”

Sensewritten form

“malla”

equivalent

Red

image

Red

Sense Sensetranslation

es - enwritten form

“red” “network”

written form

Red

written form

Formgender

femenine

“red”

Page 4: Interverbum falcon-10oct14-az

Use Case: BabelNet.org

Page 5: Interverbum falcon-10oct14-az

Words as Resources on the Web

Barak Obama if the 44th president of the United State of America. He was first elected in 2009.

Barak Obama si el 44 º presidente de los Estados Unidos de América. Ha fue electo primera vez en 2009.

http:// www.ex.org/obama_en.html

http:// www.ex.org/obama_es.html

The Web of Content The Localization Web

http://data.ex.org/String_0001

http:// data.ex.org/String_0002

DerivedFrom

DerivedFrom

Text: “Barak Obama if the 44th president of the United State of America.”Lang:en

Text:“Barak Obama si el 44 º presidente de los Estados Unidos de América.”Lang:esTranslatedBy:Google Translate

Translated From

Translation Data

Term: “United State of America.”Lang:en

Term:“Estados Unidos de América.”Lang:es

Translation Of

http:// babelnet.org/345621

http:// babelnet.org/57835

Terminology Data

Topic: Barak ObamaLang: enBirthDate: 1961-08-04 Spouse: Michelle ObamaResidence: White House

http:// Dbpedia.org/Page/Barak_Obama

Encyclopaedic Data

Page 6: Interverbum falcon-10oct14-az

Language

Resources

Language

Workers

Language

Technology

Language Lifecycle Dependencies

Page 7: Interverbum falcon-10oct14-az

Parallel Text & Term base

Posteditors

Machine

Translation

Active Curation: Dynamic MT Retraining

• Tighten curation cycle: from projects to segments

– Prioritise postedits for retraining

• Prioritise Term Identification by posteditors

• Assemble MT-ready, lexically-rich term bases

Page 8: Interverbum falcon-10oct14-az

• TermWeb/XTM/DCU• Introducing Next Gen Machine Translation

• Massive scale bilingual dictionaries• BabelNet

• NER: forced decoding• Dynamic retraining• Optimal segment translation route• L3Data curation, sharing

Next Generation Machine Translation

Page 9: Interverbum falcon-10oct14-az

Data Management Lifecycles

Publish

Correct & refine

Lex-concept lifecycleCorrect &

refineDiscover &

use

Discover & use

Correct & refine

Bitext lifecycle

Discover data

(Re)train-MT

Revise and annotate

Publish

Content lifecycle

Publish

I18n & source QA

Trans QA

Post-edit

Automated translation

Consume Create

Page 10: Interverbum falcon-10oct14-az

• Better in-context postediting: – XTM-Easyling

• Feeding term suggestions from posteditor to Terminology Management– XTM-Interverbum

• Dynamic Retraining– XTM-DCU

• Bilingual Dictionary SMT improvements– XTM-DCU

• NER, terminology enforcements, forced decoding– XTM-Interverbum-DCU

• Postediting prioritisation and term flagging– TCD-DCU-XTM

• Publishing interlinks of parallel text, lexically rich term bases– TCD: DG-T TM, EurVoc, Snomed-CT, LEMON, BabelNet

• Closing the loop – operational instrumentation of postediting– Translog, iOmegaT

Systems Integration

Page 11: Interverbum falcon-10oct14-az

More Information

• Contact: [email protected]• https://www.w3.org/International/its/wiki/Open_Data_Management_for_Public_Aut

omated_Translation_Services• http://www.falcon-project.eu

• See also: – Linked Data for Language Technology (LD4LT) W3C

Community Group– http://www.w3.org/community/ld4lt/