in an ideal world

Linguistic information is seamlessly combined to statistical information as part of translation systems to produce perfect translations

We are moving in that direction: Morphology

Syntax

Semantics (SRL): (Wu & Fung 2009) (Liu & Gildea 2010) (Aziz et al. 2011)

Meanwhile…

2

Linguistic information to evaluate MT quality Based on reference translations

Linguistic information to estimate MT quality Using machine learning

Linguistic information to detect errors in MT Automatic post-editing

3

Handle variations in MT (words and structure) wrt reference or identify differences between MT and reference

METEOR (Denkowski & Lavie 2011): words and phrases (Giménez & Màrquez 2010): matching of lexical, syntactic, semantic and discourse units

(Lo & Wu 2011): SRL and manual matching of ‘who’ did ‘what’ to ‘whom’, etc. (Rios et al. 2011): automatic SRL with automatic (inexact) matching of predicates and arguments

4

Essentially: matching of linguistic units Similar to n-gram matching metrics, but units are not only words

Metrics based on lexical units perform better

Issues: Lack of (good) resources for certain languages

Unreliable processing of incorrect translations

Sparsity for sentence-level: depending on the actual features. E.g.: matching of named entities

5

Goal: given the output of an MT system for a given input, provide an estimate of its quality

Uses◦ Filter bad quality translations from post-editing

◦ Select “perfect” translations for publishing

◦ Spot unreliable translations to readers of target language only

◦ Select best translation for a given input when multiple MT/TM systems are available

6

NOT standard MT evaluation:

◦Reference translations are NOT available

◦ Estimation for unseen translations

My approach:

◦Translation unit: sentence

◦ Independent from MT system

7

1. Define aspect of quality to estimate and

how to represent it

2. Identify and extract features that explain that

aspect of quality

3. Collect examples of translations with different

levels of quality and annotate them

4. Learn a model to predict quality scores for

new translations and evaluate it

8

Source text TranslationMT

system

Confidence

indicators

Complexity

indicators

Fluency indicators

Adequacyindicators

Quality?

Features can be shallow or linguistically motivated

9

(S/T/S-T) Sentence length (S/T) Language model (S/T) Token-type ratio (S) Readability metrics: Flesch, etc (S) Average number of possible translations per word (S) % of n-grams belonging to different frequency

quartiles of a source language corpus (T) Untranslated/OOV words (T) Mismatching brackets, quotation marks (S-T) Preservation of punctuation (S-T) Word alignment score etc

These do well for estimation of general quality wrt post-editing needs, but not enough for

other aspects of quality…

10

Count-based (S/T/S-T) Content/non-content words (S/T/S-T) Nouns/verbs/… NP/VP/… (S/T/S-T) Deictics (references) (S/T/S-T) Discourse markers (references) (S/T/S-T) Named entities (S/T/S-T) Zero-subjects (S/T/S-T) Pronominal subjects (S/T/S-T) Negation indicators (T) Subject-verb / adjective-noun agreement (T) Language Model of POS (T) Grammar checking (dangling words) (T) Coherence

11

Alignment-based (S-T) Correct translation of pronouns (S-T) Matching of dependency relations (S-T) Matching of named entities (S-T) Alignment of parse trees (S-T) Alignment of predicates & arguments etc

Some features are language-dependent, others need resources that are language-

dependent, but apply to most languages, e.g. LM of POS tags

12

Count-based feature representation:◦ Source/target only: count or proportion◦ Contrastive features (S-T): very important – but

not a simple matching of linguistic units Alignment may not be possible (e.g. clauses/phrases) Force same linguistic phenomena in S an T?

Vs translated as Ns

How to model different linguistic phenomena?

S = linguistic unit in source; T = linguistic unit in target

F S T | |F S T S TF

S

TF

S …

13

Count-based feature representation:◦ Monotonicity of features◦ Sparsity: is 0-0 as good as 10-10?

Our representation: precision and recall

◦ Does not rely on alignment◦ Upper bound = 1 (also holds for S,T=0)◦ Lower bound = 0

min( , )P

S TF

T min( , )

R

S TF

S

14

S-T: (Pighin and Màrquez 2011): learn expected projection of SRL from source to target

S-T: (Xiong et al 2010)◦ Target LM of words and POS tags, dangling words (link

grammar parser), word posterior probabilities

S-T: (Bach et al 2011)◦ Sequences of words and POS tags, context,

dependency structures, alignment info

Fine grained – need a lot of training data: 72K sentences, 2.2M words and their manual

correction (!)

15

Estimating post-editing effort Human scores (1-4): how much post-editing effort?

Estimating adequacy Human scores (1-4): to which degree does the translation convey the meaning of the original text?

1: requires complete retranslation

2: a lot of post-editing needed, but quicker than retranslation

3: a little post-editing needed 4: fit for purpose

1: completely inadequate 2: poorly adequate

3: Fairly Adequate 4: Highly Adequate

16

Machine learning algorithm: SVM for regression

Evaluation Root Mean Square Error (RMSE)

N

jjj yy

NRMSE

1

2)ˆ(1

17

English-Spanish Europarl data◦ 4 SMT systems 4 sets of 4,000 {source,

translation, score} triples

Quality score: 1-4 post-editing effort

Features: 96 shallow versus 169 shallow + ling:

18

Distribution of post-editing effort scores:

Score MT1 MT2 MT3 MT4

1 4% 9% 10% 73%

2 25% 36% 39% 21%

3 54% 40% 43% 6%

4 17% 10% 9% 0%

Avg. quality

2.83 2.56 2.51 1.34

19

RMSE:

Languages

MT System

All features

No ling. features

en-es MT1 0.600 0.574en-es MT2 0.682 0.671en-es MT3 0.671 0.654en-es MT4 0.541 0.534

Deviation of 17-22%

20

MT: The student still has claimed to take the exam at the end of the year - although she has not chosen course.

SRC: A estudante ainda tem pretensão de prestar vestibular no fim do ano – embora não tenha escolhido o curso

REF: The student still has the intention to take the exam at the end of the year – although she has not chosen the course.

21

Arabic-English Newswire data (GALE)◦ 2 SMT systems (Rosetta team) 2 sets of 2,585

{source, translation, score} triples

Quality score: 1-4 adequacy

Features: 82 shallow versus 122 shallow + ling:

22

Distribution of adequacy scores:

Score MT1 MT2

1 2% 2.3%2 20% 23%3 45% 46%4 33% 28.7%

Avg. quality

3.11 3

23

RMSE :

Languages

MT System

All feature

s

No ling feature

s

ar-en MT1 0.762 0.771ar-en MT2 0.756 0.737

Deviation of 14-26%

24

Best performing: ◦ Length (words, content-words, etc.)

Absolute numbers are better than proportions◦ Language model / corpus frequency◦ Ambiguity of source words

Shallow features are better than linguistic features◦ Except for one adequacy estimation system

Source/target features are better than contrastive features (shallow and linguistic)◦ Absolute numbers are better than proportions

25

Issues:◦ Feature representation◦Sparsity◦ Need deeper features for adequacy estimation◦Annotation:

1-4 post-editing effort: could be more objective 1-4 adequacy: can we isolate adequacy from

fluency?◦Language-dependency ◦ Reliability of resources

Low quality translations◦Availability of resources

26

General vs specific errors

Bottom-up approach: word-based CE◦ (Xiong et al 2010)

Word posterior probability, dangling words (link grammar parser), target words & POS patterns

◦ (Bach et al 2011) Dependency relations, words and POS patterns, e.g.

relate target words to patterns of POS tags in source

27

◦ (Bach et al 2011): best features are source-based

28

Top-down approach (on-going work)◦ Corpus-based analysis: generalize errors in categories◦ Portuguese-English◦ 150 sentences (2 domains, 2 MT systems)◦ RBMT: more systematic errors

Linguistic IndicatorsEuroparl

MT1NewsMT1

EuroparlMT2

NewsMT2

Inflectional error 72 40 63 40Incorrect voice 2 6 13 6Mistranslated pronoun 61 40 63 35Missing pronoun 34 13 23 7Incorrect subject-verb order 6 10 12 9

• ~700 errors / 150 sentences• 42 error categories : a few rules per

category…

29

It is possible to estimate the quality of MT systems wrt post-editing needs using shallow, language- and system-independent features

Adequacy estimation is a harder problem◦ Need more complex linguistic features…

Linguistic features are relevant:◦ Directly useful for error detection (word-level CE)◦ Directly useful for automatic post-editing◦ But… for sentence-level CE: Issues with sparsity Issues with representation: length bias

30

Lucia [email protected]

Aziz, W., Rios, M., Specia, L. (2011). Shallow Semantic Trees for SMT. WMT

Denkowski, M. and Lavie. A. 2011. Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems, WMT.

Giménez, J. and Màrquez, L. 2010. Linguistic Measures for Automatic Machine Translation Evaluation. Machine Translation, Volume 24, Numbers 3-4.

Hardmeier, C. 2011. Improving Machine Translation Quality Prediction with Syntactic Tree Kernels. EAMT-2011.

Liu, D. and Gildea, D. 2010. Semantic role features for machine translation. 23rd Conference on Computational Linguistics.

Pado, S., Galley, M., Jurafsky, D., and Manning, C. 2009. Robust Machine Translation Evaluation with Entailment Features. ACL.

32

Pighin, D. and Màrquez, L. 2011. Automatic Projection of Semantic Structures: an Application to Pairwise Translation Ranking, SSST-5.

Tatsumi, M. and Roturier, J. 2010. Source Text Characteristics and Technical and Temporal Post-Editing Effort : What is Their Relationship ?, 43-51. 2nd JEC Workshop.

Wu,D. and Fung, P. 2009. Semantic roles for SMT: a hybrid two-pass model. HLT/NAAACL.

Xiong, D., Zhang, M. and Li, H. 2010. Error Detection for SMT Using Linguistic Features. ACL-2010.

33

Best features (Pearson’s correlation) (S3 en-es):

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

CE

Aborted nodes

SMT score

Ratio scores

LM target

LM source

Bi-phrase prob

TM

Sent length

BAD 117

BAD 76

34

Filtering out bad translations: 1-2 (S3 en-es) ◦ Average human scores in the top n translations:

2.5

2.6

2.7

2.8

2.9

3

3.1

3.2

3.3

3.4

3.5

3.6

3.7

average top 100 average top 200 average top 300 average top 500

Average scores x TOP N

Human

CE

Aborted nodes

SMT score

Ratio scores

LM target

LM source

Bi-phrase prob

TM

Sent length

BAD 117

BAD 76

35

QE x MT metrics: Pearson’s correlation (S3 en-es)

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1

BLEU-4

BLEU-2

NIST

TER

Meteor exact

Meteor porter

CE

36

◦QE score x MT metrics: Pearson’s correlation across MT systems:

Test set Training set Pearson QE and human

S3 en-es S1 en-es 0.478

S2 en-es 0.517

S3 en-es 0.542

S4 en-es 0.423

S2 en-es S1 en-es 0.531

S2 en-es 0.562

S3 en-es 0.547

S4 en-es 0.442

37

SMT model global score and internal features

Distortion count, phrase probability, ...

% search nodes aborted, pruned, recombined …

Language model using n-best list as corpus

Distance to centre hypothesis in the n-best list

Relative frequency of the words in the translation in the n-

best list

Ratio of SMT model score of the top translation to the sum of

the scores of all hypothesis in the n-best list, …

38

Best performing: ◦ Length (words, content-words, etc.)

Absolute numbers are better than proportions◦ Language model / corpus frequency◦ Ambiguity of source words

Shallow features are better than linguistic features◦ Except for one adequacy estimation system

Source/target features are better than contrastive features (shallow and linguistic)◦ Absolute numbers are better than proportions

Languages

MT System

All featur

es

No ling.

features

All features abs.

en-es MT1 0.600 0.574 0.595en-es MT2 0.682 0.671 0.664en-es MT3 0.671 0.654 0.662en-es MT4 0.541 0.534 0.523

39

in an ideal world

Documents

mt quality

mt words

t word alignment scoreetcthese

new translations

entitiesstst zerosubjectssts

quality scores

aspects of quality

manual matching