poetry is what gets lost in translation. ”

Download Poetry is what gets lost in translation. ”

Post on 10-Feb-2016




0 download

Embed Size (px)


“. Poetry is what gets lost in translation. ”. Robert Frost Poet (1874 – 1963) Wrote the famous poem ‘Stopping by woods on a snowy evening’ better known as ‘ Miles to go before I sleep’. Motivation. How do we judge a good translation? Can a machine do this? - PowerPoint PPT Presentation


Poetry is what gets lost in translation.Robert FrostPoet (1874 1963)Wrote the famous poem Stopping by woods on a snowy evening better known as Miles to go before I sleep

MotivationHow do we judge a good translation?Can a machine do this?Why should a machine do this? Because humans take time!

Automatic evaluation of Machine Translation (MT): BLEU and its shortcomingsPresented by:(As a part of CS 712)

Aditya JoshiKashyap PopatShubham Gautam,

IIT BombayGuide:Prof. Pushpak Bhattacharyya,IIT Bombay3OutlineEvaluationFormulating BLEU MetricUnderstanding BLEU formulaShortcomings of BLEUShortcomings in context of English-Hindi MTDoug Arnold, Louisa Sadler, and R. Lee Humphreys, Evaluation: an assessment. Machine Translation, Volume 8, pages 127. 1993.K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a method for automatic evaluation of machine translation, IBM research report rc22176 (w0109-022). Technical report, IBM Research Division, Thomas, J. Watson Research Center. 2001.R. Ananthakrishnan, Pushpak Bhattacharyya, M. Sasikumar and Ritesh M. Shah, Some Issues in Automatic Evaluation of English-Hindi MT: More Blues for BLEU, ICON 2007, Hyderabad, India, Jan, 2007.Part IPart IIBLEU - IThe BLEU Score risesEvaluation [1]Of NLP systemsOf MT systemsEvaluation in NLP: Precision/RecallPrecision: How many results returned were correct?

Recall:What portion of correct results were returned?

Adapting precision/recall to NLP tasks

Evaluation in NLP: Precision/RecallDocument Retrieval

Precision = |Documents relevant and retrieved||Documents retrieved|

Recall=|Documents relevant and retrieved|| Documents relevant|


Precision = |True Positives||True Positives + False Positives|

Recall=|True Positives|| True Positives + False Negatives|

Evaluation in MT [1]Operational evaluationIs MT system A operationally better than MT system B? Does MT system A cost less?Typological evaluationHave you ensured which linguistic phenomena the MT system covers?Declarative evaluation How does quality of output of system A fare with respect to that of B?Operational evaluationCost-benefit is the focusTo establish cost-per-unit figures and use this as a basis for comparisonEssentially black boxRealism requirement

Word-based SASense-based SACost of pre-processing (Stemming, etc)Cost of learning a classifierCost of pre-processing (Stemming, etc)Cost of learning a classifierCost of sense annotationTypological evaluationUse a test suite of examplesEnsure all relevant phenomena to be tested are covered

Specific-to-language phenomena

For example, hari aankhon-wala ladka muskurayagreen eyes-with boy smiledThe boy with green eyes smiled.Declarative evaluationAssign scores to specific qualities of outputIntelligibility: How good the output is as a well-formed target language entityAccuracy: How good the output is in terms of preserving content of the source textFor example, I am attending a lecture Main ek vyaakhyan baitha hoonI a lecture sit (Present-first person)I sit a lecture : Accurate but not intelligible Main vyakhyan hoonI lecture amI am lecture: Intelligible but not accurate.

Evaluation bottleneckTypological evaluation is time-consumingOperational evaluation needs accurate modeling of cost-benefit

Automatic MT evaluation: DeclarativeBLEU: Bilingual Evaluation UnderstudyDeriving BLEU [2]Incorporating PrecisionIncorporating RecallHow is translation performance measured?The closer a machine translation is to a professional human translation, the better it is.

A corpus of good quality human reference translations

A numerical translation closeness metricPreliminariesCandidate Translation(s): Translation returned by an MT systemReference Translation(s): Perfect translation by humans

Goal of BLEU: To correlate with human judgmentFormulating BLEU (Step 1): PrecisionI had lunch now.Candidate 1: maine ab khana khaya matching unigrams: 3,I now food ate matching bigrams: 1I ate food now

Candidate 2: maine abhi lunch ate.matching unigrams: 2,I now lunch ateI ate lunch(OOV) now(OOV)matching bigrams: 1Unigram precision: Candidate 1: 3/4 = 0.75, Candidate 2: 2/4 = 0.5Similarly, bigram precision: Candidate 1: 0.33, Candidate 2 = 0.33Reference 1: maine abhi khana khaya I now food ate I ate food now. Reference 2 : maine abhi bhojan kiyaaI now meal didI did meal nowPrecision: Not good enoughReference: mujh-par tera suroor chhaaya me-on your spell cast Your spell was cast on me

Candidate 1: matching unigram: 3 mere tera suroor chhaaya my your spell cast Your spell cast my

Candidate 2: matching unigrams: 4tera tera tera surooryour your your spell

Unigram precision: Candidate 1: 3/4 = 0.75, Candidate 2: 4/4 = 1Formulating BLEU (Step 2): Modified PrecisionClip the total count of each candidate word with its maximum reference count

Countclip(n-gram) = min (count, max_ref_count)

Reference: mujh-par tera suroor chhaaya me-on your spell cast Your spell was cast on me

Candidate 2: tera tera tera surooryour your your spell matching unigrams: ( : min(3, 1) = 1 ) ( : min (1, 1) = 1)Modified unigram precision: 2/4 = 0.5

Modified n-gram precisionFor entire test corpus, for a given n,

n-gram: Matching n-grams in Cn-gram: All n-grams in CModified precision for n-gramsOverall candidates of test corpusFormula from [2]Calculating modified n-gram precision (1/2)127 source sentences were translated by two human translators and three MT systemsTranslated sentences evaluated against professional reference translations using modified n-gram precisionCalculating modified n-gram precision (2/2)Decaying precision with increasing nComparative ranking of the five

Combining precision for different values of n-grams?

Graph from [2]Formulation of BLEU: RecapPrecision cannot be used as isModified precision considers clipped word count

Recall for MT (1/2)Candidates shorter than referencesReference: ?kya blue lambe vaakya ki guNvatta ko samajh paaega?will blue long sentence-of quality (case-marker) understand able(III-person-male-singular)?Will blue be able to understand quality of long sentence?Candidate: lambe vaakyalong sentencelong sentencemodified unigram precision: 2/2 = 1modified bigram precision: 1/1 = 1

Recall for MT (2/2)Candidates longer than referencesReference 2: maine bhojan kiyaaI meal didI had meal

Candidate 1: maine khaana bhojan kiya I food meal did I had food mealModified unigram precision: 1

Reference 1: maine khaana khaayaI food ateI ate food

Candidate 2: maine khaana khaayaI food ateI ate foodModified unigram precision: 1

Formulating BLEU (Step 3): Incorporating recallSentence length indicates best matchBrevity penalty (BP): Multiplicative factorCandidate translations that match reference translations in length must be ranked higher

Candidate 1: Candidate 2: ?

Formulating BLEU (Step 3): Brevity Penalty

e^(1-x)Graph drawn using www.fooplot.com

BPBP = 1 for c > r.

Why?x = ( r / c )Formula from [2]r: Reference sentence lengthc: Candidate sentence lengthBP leaves out longer translationsWhy?Translations longer than reference are already penalized by modified precisionValidating the claim:

Formula from [2]BLEU scorePrecision -> Modified n-gram precision

Recall -> Brevity PenaltyFormula from [2]Understanding BLEUDissecting the formulaUsing it: news headline exampleDecay in precisionWhy log pn?To accommodate decay in precision values

Graph from [2]Formula from [2]Dissecting the FormulaClaim: BLEU should lie between 0 and 1Reason: To intuitively satisfy 1 implies perfect translationUnderstanding constituents of the formula to validate the claim

Brevity PenaltyModified precisionSet to 1/NFormula from [2]Validation of range of BLEU


pn : Between 0 and 1log pn : Between (infinity) and 0 A: Between (infinity) and 0 e ^ (A): Between 0 and 1ABP: Between 0 and 1Graph drawn using www.fooplot.com(r/c)pnlog pn

Ae^(A)Calculating BLEU: An actual exampleRef: bhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity

C1: laane-ke-liye ek tyoaar bhaavanaatmak ekikaran bring-to a festival emotional unity (This is invalid Hindi ordering)

C2: ko ek utsav ke baare mein laana bhaavanaatmak ekikaranfor a festival-about to-bring emotional unity (This is invalid Hindi ordering)Modified n-gram precisionRef: C1:


n#Matching n-grams#Total n-gramsModified n-gram Precision17712560.83n#Matching n-grams#Total n-gramsModified n-gram Precision1490.442180.125r: 7c: 7r: 7c: 9Calculating BLEU scorenwnpnlog pnwn *log pn

11 / 210021 / 20.83-0.186-0.093Total:-0.093BLEU:0.911wn= 1 / N = 1 / 2r = 7

C1:c = 7BP = exp(1 7/7) = 1

C2:c = 9BP = 1

nwnpnlog pnwn *log pn11 / 20.44-0.8210.410521 / 20.125-2.07-1.035Total:-1.445BLEU:0.235C1:C2:

Hence, the BLEU scores...C1:

C2: 0.911

0.235Ref: BLEU v/s human judgment [2]Target language: EnglishSource language: ChineseSetupBLEU scores obtained for each system

Five systems perform translation:3 automatic MT systems2 human translatorsHuman judgment (on scale of 5) obtained for each system:Group 1: Ten Monolingual speakers of target language (English)Group 2: Ten Bilingual speakers of Chinese and English

BLEU v/s human judgmentMonolingual speakers:Correlation co-efficient: 0.99

Bilingual speakers:Correlation co-efficient: 0.96

BLEU and human


View more >