what machine translation developers are doing to make post-editors happy
TRANSCRIPT
What MT developers are doing… …to make post-editors happy
John TinsleyCEO and Co-founder
WPTP4 @ MT Summit. Miami. 3rd November 2015
We provide Machine Translation solutions with Subject Matter Expertise
MT solutions and services provider, specializing in providing customised solutions with subject matter expertise for specific technical sectors, such as Patents/IP, life sciences, and financial.
MT for Information Purposes
MT Application Areas
MT for Post-editing Productivity
• Development focuses on improving key information translation• Terminology is important• Evaluation driven by “usability”
• Development focuses on reducing edits required• Feedback loop is crucial• Evaluation through practical translation tasks
Use cases in practice
Product descriptions to open new markets
MT for post-editing productivity across
industries
Developer, and user for web content
Tens of thousands of people using online
tools daily
TRANSLATION
“Four Pillars of Happiness”
QUALITY
EVALUATION
INTEGRATION
FEEDBACK
Ensuring the the output is the highest quality possible!
Making sure the MT fits seamlessly into the workflow
Letting users know how good to expect output to be
Bringing the translator into the loop to affect change
Quality There’s no silver bullet when it comes to improving MT quality
• What is being done to improve MT*a) on a broader, technology level?b) on a lower level for specific languages / domains?
Quality
*not with the express purpose of making post-editors happy J
• What is being done to improve MT*a) on a broader, technology level?b) on a lower level for specific languages / domains?
Quality
*not with the express purpose of making post-editors happy J
– Neural networks and deep learning• something new, totally different, the future?
– Online adaptive MT• improving specific engines rapidly [feedback]
– Syntax-based MT (tree-to-string, etc.)• incorporating elements of linguistics
• What is being done to improve MT*a) on a broader, technology level?b) on a lower level for specific languages / domains?
Quality
*not with the express purpose of making post-editors happy J
– Chinese• segmentation, 的 (de) particle
– German• long-distance verb movements, compound splitting / joining
– Irish• more fundamental, data collection, resource development
• What is being done to improve MT*a) on a broader, technology level?b) on a lower level for specific languages / domains?
Quality
– Chinese• segmentation, 的 (de) particle
• What is being done to improve MT*a) on a broader, technology level?b) on a lower level for specific languages / domains?
Quality
*not with the express purpose of making post-editors happy J
– Chinese• segmentation, 的 (de) particle
– German• long-distance verb movements, compound splitting / joining
– Irish• more fundamental, data collection, resource development
• What is being done to improve MT*a) on a broader, technology level?b) on a lower level for specific languages / domains?
Quality
– German• long-distance verb movements, compound splitting / joining
• What is being done to improve MT*a) on a broader, technology level?b) on a lower level for specific languages / domains?
Quality
*not with the express purpose of making post-editors happy J
– Chinese• segmentation, 的 (de) particle
– German• long-distance verb movements, compound splitting / joining
– Irish• more fundamental, data collection, resource development
• What is being done to improve MT*a) on a broader, technology level?b) on a lower level for specific languages / domains?
Quality
*not with the express purpose of making post-editors happy J
– MT for User Generated Content @ …• how do handle misspellings, text speak, etc.
– Patent focused MT @ Iconic• concentrating on mix of technical language and style
– MT for online course materials @ TraMOOC• European H2020 project
Evaluation • Objectively provide stakeholders information such as:
a) general quality expectations of an MT engineb) how it’s impacting individual translators’ performancec) what specific areas could be improved
Lots of different ways to do evaluation– automatic scores
• BLEU, METEOR, GTM, TER
– fluency, adequacy, comparative ranking– task-based evaluation
• error analysis, post-edit productivity
Different metrics, different intelligence– what does each type of metric tell us?– which ones are usable at which stage of evaluation?
e.g. can we really use automatic scores to assess productivity?
e.g. does productivity delta really tell us how good the output is?
MT Evaluation – where do we start!?
ProblemLarge Chinese to English patent translation project. Challenging content and language
QuestionWhat if any efficiencies can machine translation add to the workflow of RWS translators?
How we applied different types of MT evaluation and different stages in the process, at various go/no stages, to help RWS to assess whether MT is viable for this project
Evaluation Case Study – RWS
- UK headquartered public company- Founded 1958- 9th largest LSP (CSA 2013 report)- Leader in specialist IP translations
Can we improve our baseline engines through customisation? Step 1: Are the engines any good?
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
BLEU TER
Iconic Baseline
Iconic Customised
What next?
How good is the output relative to the task, i.e. post-editing?- fluency/adequacy not going to tell us- let’s start with segment level TER
- Huge improvement
- Intuitively, scores reflect well but don’t really say anything
- Let’s dig deeper
Translation Edit Rate: correlates well with practical evaluations
If we look deeper, what can we learn?
INTELLIGENCE
• Proportion of full matches (i.e. big savings)
• Proportion of close matches (i.e. faster that fuzzy matches)
• Proportion of poor matches
ACTIONABLE INFORMATION
• Type of sentence with high/low matches
• Weaknesses and gaps
• Segments to compare and analyse in translation memory
TER
sco
re
Step 2: Are they any good for post-editing?
Distribution of segment-level TER scores
segment length
With MT experience and previous MT integration, productivity testing can be run in the production environment. In this case we used, the TAUS Dynamic Quality Framework
Step 3: Quantifying with ACTUAL translators
Productivity Test
Productivity Test
With MT experience and previous MT integration, productivity testing can be run in the production environment. In this case we used, the TAUS Dynamic Quality Framework
Beware the variables!• Translators: different experience, speed, perceptions of MT
– 24 translators: senior, staff, and interns
• Test sets: not representative; particularly difficult– 2 tests sets, comprising 5 documents, and cross-fold validation
• Environment and task: inexperience and unfamiliarity– Training materials, videos, and “dummy” segments
Step 3: Productivity testing
Overall average
Findings and Learnings
25% productivity gain
Experienced: 22%Staff: 23%
Interns: 30%
Test set 1.1: 25%Test set 1.2: 35%Test set 2.1: 06%Test set 2.2: 35%
Correlates with TER
Rollout with junior staff for more immediate impact on bottom line?
Don’t be over concerned by outliers.Use data to facilitate source content profiling?
What it tells us
By Translator Profile
By Test Set
Evaluation • Objectively provide stakeholders information such as:
a) general quality expectations of an MT engine ✔b) how it’s impacting individual translators’ performance ✔c) what specific areas could be improved ✔
Now we actually talk to the translators to get their feedback on the task, the MT output, and start that virtuous loop…we’ll come back to this
Metrics• WMT metrics shared task• New(er) metrics designed to correlate with post-editing effort• Optimising MT engines on new / different metrics
Estimating the quality of MT output in real-time at runtime
• Binary classification (good/bad)• Multi-label classification, scores• Word level, error categorisation
Quality Estimation and other features
Engaging end-users – post-editors, LSPs – both directly and indirect to take on-board feedback for the betterment of MT
Feedback
Direct Feedback• talking to the translators (imagine!)• collecting structured feedback
– error categorisation– correction– severity
• commenting on error types and actions
Establish a relationship and understanding to foster acceptance
The machine translation engine will never be 100% perfect. Certain types of sentences will always lend themselves better to MT than others. Our joint goal is to get the machine translation quality to a level that a majority of the sentences are translated well, and the process of post-editing will be faster and more efficient than piecing together translations from a combination of fuzzy matches, terminology, and reference translations.
There are certain types of MT output errors that can be fixed quickly and easily, while others are more fundamental issues that will get fixed with general improvement of the engines and technology itself over time. Here are some examples of each:
If we encounter an error that is just a "minor" mistake and, in general, the contextaround it is ok, sometimes the best approach is to simply leave it for post-editing.
Understanding the MT developer
Fixed Over Time- General grammatical errors- Sentence-level disfluency- Noun phrase ordering
Quick Fixes- Technical terminology- Frequent, consistent set phrases - Stylistic/formatting errors
Engaging end-users – post-editors, LSPs – both directly and indirect to take on-board feedback for the betterment of MT
Feedback
Indirect Feedback• terminology management• automatic post-edit rules• templates for
generalisation
Empowering the translator to affect change themselves
• Make MT fit as seamlessly as possible into the translator workflowa) directly into existing CAT toolsb) new CAT toolsc) what else would you like? J
Integration
• Most CAT tools have MT plugins for most MT vendors• Studio, MemoQ, Wordfast, MultiTrans
• Matecat making MT more central• facilitating online learning technology too
• Highlighting, instrumentation, TM / MT cooperation
“The biggest room in the world is the room for improvement”
Thank You! [email protected]
@IconicTrans