Page 1
Measuring Translation Quality in Today’s Automated Lifecycle
!
Arle Lommel & Aljoscha Burchardt (DFKI) with help from Lucia Specia (University of Sheffield) and Hans Uszkoreit (DFKI)
Funded by the 7th Framework Programme of the European Commission through the contract 296347.
Funded by the 7th Framework Programme of the European Commission through the contract 296347.
Page 2
PROBLEMS IN ASSESSING QUALITY
Page 3
95% of professionally translated content at one major LSP is never evaluated for
quality.
Page 4
“Translators are thegarbage collectors of the documentation world” –Alison
Toon, HP
Page 5
“I know it when I see it”
Page 6
Machine Translation quality scores (BLEU, NIST, etc.) are
totally different from human evaluation.
Page 7
MT methods requirereference translations:
Cannot be used for production purposes
Page 8
Change the reference translation(s) and
the score changes
Page 9
The problem with BLEU
No substantial improvement for human use
Page 10
The problem with BLEU
Substantial human improvement but no BLEU improvement
Page 11
Human quality assessment takes too much time.
Page 12
Sampling is random but errors are not.
Page 13
Wait a minute… What do you mean by
quality?
Page 14
Quality: A New Definition
A quality translation demonstrates required accuracy and fluency
for the audience and purpose andcomplies with all other negotiated specifications,
taking into account end-user needs.
Source: Alan Melby
Page 15
Sounds simple, right?
Page 16
It’s actually quite radical and it drags translation
kicking and screaming into the modern world of
quality management
Page 17
Multidimensional Quality Metrics
Page 18
Why not use ashared metric?
Page 20
LISA QA Model SAE J2450 SDL TMS
Acrocheck ApSIC XBench
CheckMate QA Distiller XLIFF:Doc
EN15038…
Page 21
All of them disagree* about what is important to
quality
*The only thing they agree on is terminology
Page 22
(Probably because there isno single set of criteria that applies to all kinds of
translation)
Page 23
There is no one-size-fits-all metric
Page 24
MQM provides a catalog of issue types suitable for
various tasks
Page 27
Wait! Weren’t we trying to improve things?
(That looks like a bowl of noodles!)
Page 29
Accuracy and Fluency What’s Verity?
Page 30
Verity provides a way to deal with the text in relation to
the real world.
Page 31
You don’t use all of MQM (or its core):you use the
parts you need.
Page 32
MQM for MT Diagnostics
Page 34
MQM lets you declare your quality metric in a shared
vocabulary.
Page 35
Dimensions help you decide what to check
(and also help you communicate with your LSP)
Page 36
No more assuming what the parties want or how to check it
Page 37
12 Dimensions(from ISO/TS-11669)
1. Language/locale 2. Subject field/domain 3. Terminology (source/
target) 4. Text type 5. Audience 6. Purpose
7. Register 8. Target text style 9. Content
correspondence 10. Output modality 11. File format 12. Production technology
Page 38
Open-source tools* to demonstrate MQM
*translate5 source code is published.Other tools’ code will be published in 2014
Page 39
An Online Tool for Building Dimensions and Metrics
http
://w
ww
.qt2
1.eu
/MQ
M
Page 40
Tabular Scorecard
http
://w
ww
.qt2
1.eu
/MQ
M
Page 41
Ergonomic Scorecard
http
://w
ww
.qt2
1.eu
/MQ
M
Page 42
translate5
DEMO: http://www.translate5.net
Page 43
Currently discussing with TAUS how to harmonize MQM
and the DQF Error Typology
Page 44
Quality Estimation (QuEst)
Page 45
How can you evaluate MT quality when you don’t
have reference translations?
Page 46
QuEst (Quality Estimation) An open-source tool for
estimating translation quality
Page 47
Quality Estimation (QE) Metrics• Automatic metrics that provide an estimate on the quality
of (machine) translated segments, without reference translations
• Quality defined according to the problem at hand: • Adequacy • Fluency • Post-editing effort, etc.
Page 48
Task-Based Quality• Does it need human revision to achieve HT
quality?
• Can a reader get the gist?
• How much effort is required to post-edit the text? (If we know this we have a business case for MT)
Page 49
QuEst Framework• Open source tool for QE: http://www.quest.dcs.shef.ac.uk/
• E.g. predict 1-5 scores for post-editing effort:
• 1 highest, 5 lowest
• English-Spanish news data, but can be used for other language pairs
• System built from 1,000 examples of translated segments annotated by humans
Page 50
Uses a set of bilingual training data* to establish linguistic
baselines* Uses source+MT for training, but can also use extra resources (e.g., language models trained on TM)
Page 51
Provides an estimate for how well the translation fits with
your existing translations
Page 52
QuEst can rank multiple translations to find the best
one
Page 53
QuEst Rating Example
Reliability: QuEst rating differs on average 0.61 from human-assigned scores
Page 54
No more random samples. QuEst can identify sentences that
are likely to require human attention
Page 55
QuEst can tell you where it makes sense to post-edit and
where it makes sense to start from scratch.
Page 56
QuEst Improves Post-Editing Time
Language PE time without QE PE time with QE % increase in PE productivity
FR→EN 0.75 words/second 1.09 words/second 45%
EN→ES 0.32 words/second 0.57 words/second 78%
Page 57
QuEst + MQM Targeted quality evaluation combining the strengths of
humans and machines
Page 58
MQM will be turned over to industry for long-term
maintenance and eventual standardization
Page 59
Learn more at http://www.qt21.eu
Page 60
Detailed presentation covering MQM pilot study
(in German) today at 5:15
Page 61
Join us tomorrow morning for a detailed demonstration
of how to use MQM (9:45–10:30)