poetry is what gets lost in translation. ”
Post on 10-Feb-2016
Embed Size (px)
DESCRIPTION“. Poetry is what gets lost in translation. ”. Robert Frost Poet (1874 – 1963) Wrote the famous poem ‘Stopping by woods on a snowy evening’ better known as ‘ Miles to go before I sleep’. Motivation. How do we judge a good translation? Can a machine do this? - PowerPoint PPT Presentation
Poetry is what gets lost in translation.Robert FrostPoet (1874 1963)Wrote the famous poem Stopping by woods on a snowy evening better known as Miles to go before I sleep
MotivationHow do we judge a good translation?Can a machine do this?Why should a machine do this? Because humans take time!
Automatic evaluation of Machine Translation (MT): BLEU and its shortcomingsPresented by:(As a part of CS 712)
Aditya JoshiKashyap PopatShubham Gautam,
IIT BombayGuide:Prof. Pushpak Bhattacharyya,IIT Bombay3OutlineEvaluationFormulating BLEU MetricUnderstanding BLEU formulaShortcomings of BLEUShortcomings in context of English-Hindi MTDoug Arnold, Louisa Sadler, and R. Lee Humphreys, Evaluation: an assessment. Machine Translation, Volume 8, pages 127. 1993.K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a method for automatic evaluation of machine translation, IBM research report rc22176 (w0109-022). Technical report, IBM Research Division, Thomas, J. Watson Research Center. 2001.R. Ananthakrishnan, Pushpak Bhattacharyya, M. Sasikumar and Ritesh M. Shah, Some Issues in Automatic Evaluation of English-Hindi MT: More Blues for BLEU, ICON 2007, Hyderabad, India, Jan, 2007.Part IPart IIBLEU - IThe BLEU Score risesEvaluation Of NLP systemsOf MT systemsEvaluation in NLP: Precision/RecallPrecision: How many results returned were correct?
Recall:What portion of correct results were returned?
Adapting precision/recall to NLP tasks
Evaluation in NLP: Precision/RecallDocument Retrieval
Precision = |Documents relevant and retrieved||Documents retrieved|
Recall=|Documents relevant and retrieved|| Documents relevant|
Precision = |True Positives||True Positives + False Positives|
Recall=|True Positives|| True Positives + False Negatives|
Evaluation in MT Operational evaluationIs MT system A operationally better than MT system B? Does MT system A cost less?Typological evaluationHave you ensured which linguistic phenomena the MT system covers?Declarative evaluation How does quality of output of system A fare with respect to that of B?Operational evaluationCost-benefit is the focusTo establish cost-per-unit figures and use this as a basis for comparisonEssentially black boxRealism requirement
Word-based SASense-based SACost of pre-processing (Stemming, etc)Cost of learning a classifierCost of pre-processing (Stemming, etc)Cost of learning a classifierCost of sense annotationTypological evaluationUse a test suite of examplesEnsure all relevant phenomena to be tested are covered
For example, hari aankhon-wala ladka muskurayagreen eyes-with boy smiledThe boy with green eyes smiled.Declarative evaluationAssign scores to specific qualities of outputIntelligibility: How good the output is as a well-formed target language entityAccuracy: How good the output is in terms of preserving content of the source textFor example, I am attending a lecture Main ek vyaakhyan baitha hoonI a lecture sit (Present-first person)I sit a lecture : Accurate but not intelligible Main vyakhyan hoonI lecture amI am lecture: Intelligible but not accurate.
Evaluation bottleneckTypological evaluation is time-consumingOperational evaluation needs accurate modeling of cost-benefit
Automatic MT evaluation: DeclarativeBLEU: Bilingual Evaluation UnderstudyDeriving BLEU Incorporating PrecisionIncorporating RecallHow is translation performance measured?The closer a machine translation is to a professional human translation, the better it is.
A corpus of good quality human reference translations
A numerical translation closeness metricPreliminariesCandidate Translation(s): Translation returned by an MT systemReference Translation(s): Perfect translation by humans
Goal of BLEU: To correlate with human judgmentFormulating BLEU (Step 1): PrecisionI had lunch now.Candidate 1: maine ab khana khaya matching unigrams: 3,I now food ate matching bigrams: 1I ate food now
Candidate 2: maine abhi lunch ate.matching unigrams: 2,I now lunch ateI ate lunch(OOV) now(OOV)matching bigrams: 1Unigram precision: Candidate 1: 3/4 = 0.75, Candidate 2: 2/4 = 0.5Similarly, bigram precision: Candidate 1: 0.33, Candidate 2 = 0.33Reference 1: maine abhi khana khaya I now food ate I ate food now. Reference 2 : maine abhi bhojan kiyaaI now meal didI did meal nowPrecision: Not good enoughReference: mujh-par tera suroor chhaaya me-on your spell cast Your spell was cast on me
Candidate 1: matching unigram: 3 mere tera suroor chhaaya my your spell cast Your spell cast my
Candidate 2: matching unigrams: 4tera tera tera surooryour your your spell
Unigram precision: Candidate 1: 3/4 = 0.75, Candidate 2: 4/4 = 1Formulating BLEU (Step 2): Modified PrecisionClip the total count of each candidate word with its maximum reference count
Countclip(n-gram) = min (count, max_ref_count)
Reference: mujh-par tera suroor chhaaya me-on your spell cast Your spell was cast on me
Candidate 2: tera tera tera surooryour your your spell matching unigrams: ( : min(3, 1) = 1 ) ( : min (1, 1) = 1)Modified unigram precision: 2/4 = 0.5
Modified n-gram precisionFor entire test corpus, for a given n,
n-gram: Matching n-grams in Cn-gram: All n-grams in CModified precision for n-gramsOverall candidates of test corpusFormula from Calculating modified n-gram precision (1/2)127 source sentences were translated by two human translators and three MT systemsTranslated sentences evaluated against professional reference translations using modified n-gram precisionCalculating modified n-gram precision (2/2)Decaying precision with increasing nComparative ranking of the five
Combining precision for different values of n-grams?
Graph from Formulation of BLEU: RecapPrecision cannot be used as isModified precision considers clipped word count
Recall for MT (1/2)Candidates shorter than referencesReference: ?kya blue lambe vaakya ki guNvatta ko samajh paaega?will blue long sentence-of quality (case-marker) understand able(III-person-male-singular)?Will blue be able to understand quality of long sentence?Candidate: lambe vaakyalong sentencelong sentencemodified unigram precision: 2/2 = 1modified bigram precision: 1/1 = 1
Recall for MT (2/2)Candidates longer than referencesReference 2: maine bhojan kiyaaI meal didI had meal
Candidate 1: maine khaana bhojan kiya I food meal did I had food mealModified unigram precision: 1
Reference 1: maine khaana khaayaI food ateI ate food
Candidate 2: maine khaana khaayaI food ateI ate foodModified unigram precision: 1
Formulating BLEU (Step 3): Incorporating recallSentence length indicates best matchBrevity penalty (BP): Multiplicative factorCandidate translations that match reference translations in length must be ranked higher
Candidate 1: Candidate 2: ?
Formulating BLEU (Step 3): Brevity Penalty
e^(1-x)Graph drawn using www.fooplot.com
BPBP = 1 for c > r.
Why?x = ( r / c )Formula from r: Reference sentence lengthc: Candidate sentence lengthBP leaves out longer translationsWhy?Translations longer than reference are already penalized by modified precisionValidating the claim:
Formula from BLEU scorePrecision -> Modified n-gram precision
Recall -> Brevity PenaltyFormula from Understanding BLEUDissecting the formulaUsing it: news headline exampleDecay in precisionWhy log pn?To accommodate decay in precision values
Graph from Formula from Dissecting the FormulaClaim: BLEU should lie between 0 and 1Reason: To intuitively satisfy 1 implies perfect translationUnderstanding constituents of the formula to validate the claim
Brevity PenaltyModified precisionSet to 1/NFormula from Validation of range of BLEU
pn : Between 0 and 1log pn : Between (infinity) and 0 A: Between (infinity) and 0 e ^ (A): Between 0 and 1ABP: Between 0 and 1Graph drawn using www.fooplot.com(r/c)pnlog pn
Ae^(A)Calculating BLEU: An actual exampleRef: bhaavanaatmak ekikaran laane ke liye ek tyohaaremotional unity bring-to a festivalA festival to bring emotional unity
C1: laane-ke-liye ek tyoaar bhaavanaatmak ekikaran bring-to a festival emotional unity (This is invalid Hindi ordering)
C2: ko ek utsav ke baare mein laana bhaavanaatmak ekikaranfor a festival-about to-bring emotional unity (This is invalid Hindi ordering)Modified n-gram precisionRef: C1:
n#Matching n-grams#Total n-gramsModified n-gram Precision17712560.83n#Matching n-grams#Total n-gramsModified n-gram Precision1490.442180.125r: 7c: 7r: 7c: 9Calculating BLEU scorenwnpnlog pnwn *log pn
11 / 210021 / 20.83-0.186-0.093Total:-0.093BLEU:0.911wn= 1 / N = 1 / 2r = 7
C1:c = 7BP = exp(1 7/7) = 1
C2:c = 9BP = 1
nwnpnlog pnwn *log pn11 / 20.44-0.8210.410521 / 20.125-2.07-1.035Total:-1.445BLEU:0.235C1:C2:
Hence, the BLEU scores...C1:
0.235Ref: BLEU v/s human judgment Target language: EnglishSource language: ChineseSetupBLEU scores obtained for each system
Five systems perform translation:3 automatic MT systems2 human translatorsHuman judgment (on scale of 5) obtained for each system:Group 1: Ten Monolingual speakers of target language (English)Group 2: Ten Bilingual speakers of Chinese and English
BLEU v/s human judgmentMonolingual speakers:Correlation co-efficient: 0.99
Bilingual speakers:Correlation co-efficient: 0.96
BLEU and human