Multilingual Data Value Chain for CEF Automated Translation:
Interoperability Plan
CEF.AT workshop Luxembourg, 22 Sept 2015 Dave Lewis, ADAPT Centre [email protected]
Translators CEF.AT
Sustainable ML Data Value Chain
Improved Productivity
Low Cost MT
Domain Adapted
Approved Terminology
Language Pairs
Language Resources
TM
Term base
Domain knowledge
Quality Assurance
Language Resources
Discover Check Rights Select & Use
Translation Productivity
Postedit Consume Quality Assure
LR Enrichment
Enrichment Services Validate Annotation Self-build Micro-domains
Value Chain Interoperability
• Move from Archival Curation to Active Curation • Open meta-data published at source:
– W3C Data Catalogue Vocabulary (DCAT) – Legacy meta-data conversion & validation – Concretely: Meta-Share Linked Data Mapping
• http://www.w3.org/community/ld4lt/wiki/Meta-Share_OWL_metamodel
• Searchable Cataloguing Service: – Concretely – LingHub:
• http://linghub.lider-project.eu/
• Machine readable rights/license – W3C Open Digital Rights Language
• https://www.w3.org/community/odrl/
– Use for Translation IP
LR: Discovery & Useage Rights
• Linked data from existing format: – TMX, XLIFF to W3C CSV-on-the-Web to
RDF • Selection meta-data
– Provenance (MT or PE) & translation language codes
– Dereferencable segments for open annotation of terms
• MT Web Service APIs – Forced decoding with term translations – Iterative Re-training API – MT log data: out of vocabulary & forced term
to inform PE productivity
LR Select & Use
• Bottom line: did MT make translation more productive?
• Measure #1: Post-editing effort – A/B test on total segment post-editing time – Open Edit Vector format – iOmegaT- instrumented open source CAT tool – Edit vector analysis tool - licensable
• Measure #2: ML Web Site analysis – A/B test on translated web pages (MT vs PE vs HT) – Easyling web translation proxy
Translation Productivity
• Enrich segments with links to open lexical-conceptual resources – Word Sense Disambiguation, Entity
Linking, Automated Term Extraction • Babelfy API • DBPedia Spotlight, TaaS APIs
• Open validation – Publish +/-ve validation of enrichment from
translation projects – In-context validation from project
posteditors and terminologists using TBX status flags
LR Enrichment
• Goal: Reduce cost of collecting and selecting parallel data • Agree & Promote DCAT Profile for publishing public
sector parallel text • Establish suite of common machine-readable licences
(ODRL) • DCAT and licence meta-data profile for standardised
parallel text format – XLIFF 2.0 module – TMX update – new OASIS TC – CSV on the Web
• Linghub as basis for public index/search service • Minimise distance between published parallel text and
meta-data passed along translation value chain
Interoperability Plan: Parallel Text
• Goal: Make it easy for public bodies to measure impact of MT on their translation processes
• Agree/Promote Open Edit Vector format – Encourage integration in CAT tools
• Guidelines on A/B testing, analysis and interpretation
• Open feedback channels to CEF.AT
Interoperability Plan: Productivity
• Goal: annotate segments with links to terms and lexical-conceptual resources
• Agree/promote Open Annotation links – XLIFF 2.1: inline ITS Terminology and TextAnalysis
attributes or standoff with XLIFF fragment – Need similar ITS profile and fragment for TMX – Profile W3C CSV-on-the-Web with Open Annotation
• Guidelines on deferencing Links to Term-bases or Lexical-Conceptual resources – W3C Ontolex group
• Validation workflow and feedback – Trials with FREME, Babelfy, others
Interoperability Plan: LR Enrichment
THANK YOU! [email protected]