Machine Learning for
Machine Translation
Pushpak Bhattacharyya
CSE Dept
IIT Bombay
ISI Kolkata
Jan 6 2014
Jan 6 2014 ISI Kolkata ML for MT 1
NLP Trinity
Algorithm
Problem
Language Hindi
Marathi
English
French Morph Analysis
Part of Speech Tagging
Parsing
Semantics
CRF
HMM
MEMM
NLP Trinity
Jan 6 2014 ISI Kolkata ML for MT 2
NLP Layer
Morphology
POS tagging
Chunking
Parsing
Semantics Extraction
Discourse and Corefernce
Increased Complexity Of Processing
Jan 6 2014 ISI Kolkata ML for MT 3
Motivation for MT
MT NLP Complete
NLP AI complete
AI CS complete
How will the world be different when the language barrier disappears
Volume of text required to be translated currently exceeds translatorsrsquo capacity (demand gt supply)
Solution automation
Jan 6 2014 ISI Kolkata ML for MT 4
Taxonomy of MT systems
MT Approaches
Knowledge Based Rule Based MT
Data driven Machine Learning Based
Example Based MT (EBMT)
Statistical MT
Interlingua Based Transfer Based
Jan 6 2014 ISI Kolkata ML for MT 5
Why is MT difficult
Language divergence
Jan 6 2014 ISI Kolkata ML for MT 6
Why is MT difficult Language Divergence
One of the main complexities of MT Language Divergence
Languages have different ways of expressing meaning
Lexico-Semantic Divergence
Structural Divergence
Our work on English-IL Language Divergence with illustrations from Hindi (Dave Parikh Bhattacharyya Journal of MT 2002)
Jan 6 2014 ISI Kolkata ML for MT 7
Languages differ in expressing thoughts Agglutination
Finnish ldquoistahtaisinkohanrdquo
English I wonder if I should sit down for a whileldquo
Analysis
ist + sit verb stem
ahta + verb derivation morpheme to do something for a while
isi + conditional affix
n + 1st person singular suffix
ko + question particle
han a particle for things like reminder (with declaratives) or softening (with questions and imperatives)
Jan 6 2014 ISI Kolkata ML for MT 8
Language Divergence Theory Lexico-
Semantic Divergences (few examples)
Conflational divergence F vomir E to be sick E stab H chure se maaranaa (knife-with hit) S Utrymningsplan E escape plan
Categorial divergence
Change is in POS category
The play is on_PREP (vs The play is Sunday) Khel chal_rahaa_haai_VM (vs khel ravivaar ko haai)
Jan 6 2014 ISI Kolkata ML for MT 9
Language Divergence Theory Structural Divergences
SVOSOV E Peter plays basketball H piitar basketball kheltaa haai
Head swapping divergence E Prime Minister of India H bhaarat ke pradhaan mantrii (India-of Prime
Minister)
Jan 6 2014 ISI Kolkata ML for MT 10
Language Divergence Theory Syntactic
Divergences (few examples)
Constituent Order divergence
E Singh the PM of India will address the nation today
H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)
Adjunction Divergence
E She will visit here in the summer
H vah yahaa garmii meM aayegii (she here summer-in will come)
Preposition-Stranding divergence
E Who do you want to go with
H kisake saath aap jaanaa chaahate ho (who withhellip)
Jan 6 2014 ISI Kolkata ML for MT 11
Vauquois Triangle
Jan 6 2014 ISI Kolkata ML for MT 12
Kinds of MT Systems (point of entry from source to the target text)
Jan 6 2014 ISI Kolkata ML for MT 13
Illustration of transfer SVOSOV
S
NP VP
V N NP
N John eats
bread
S
NP VP
V N
John eats
NP
N
bread
(transfer svo sov)
Jan 6 2014 ISI Kolkata ML for MT 14
Universality hypothesis
Universality hypothesis At the
level of ldquodeep meaningrdquo all texts
are the ldquosamerdquo whatever the
language
Jan 6 2014 ISI Kolkata ML for MT 15
Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)
H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स
अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke
maadhyam se apne raajaswa ko badhaayaa
G11 Government_(ergative) elections_after Mumbai_in
taxes_through its revenue_(accusative) increased
E11 The Government increased its revenue after the
elections through taxes in Mumbai
Jan 6 2014 ISI Kolkata ML for MT 16
Interlingual representation complete disambiguation
Washington voted Washington to power Vote
past
Washington power Washington
emphasis
ltis-a gt action
ltis-a gt place
ltis-a gt capability ltis-a gt hellip
ltis-a gt person
goal
Jan 6 2014 ISI Kolkata ML for MT 17
Kinds of disambiguation needed for a complete and correct interlingua graph
N Name
P POS
A Attachment
S Sense
C Co-reference
R Semantic Role
Jan 6 2014 ISI Kolkata ML for MT 18
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech Noun or Verb
Jan 6 2014 ISI Kolkata ML for MT 19
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
John is the name of a PERSON
Jan 6 2014 ISI Kolkata ML for MT 20
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD Financial
bank or River bank
Jan 6 2014 ISI Kolkata ML for MT 21
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
ldquoitrdquo ldquobankrdquo
Jan 6 2014 ISI Kolkata ML for MT 22
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
Subject Drop
Pro drop (subject ldquoIrdquo)
Jan 6 2014 ISI Kolkata ML for MT 23
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
NLP Trinity
Algorithm
Problem
Language Hindi
Marathi
English
French Morph Analysis
Part of Speech Tagging
Parsing
Semantics
CRF
HMM
MEMM
NLP Trinity
Jan 6 2014 ISI Kolkata ML for MT 2
NLP Layer
Morphology
POS tagging
Chunking
Parsing
Semantics Extraction
Discourse and Corefernce
Increased Complexity Of Processing
Jan 6 2014 ISI Kolkata ML for MT 3
Motivation for MT
MT NLP Complete
NLP AI complete
AI CS complete
How will the world be different when the language barrier disappears
Volume of text required to be translated currently exceeds translatorsrsquo capacity (demand gt supply)
Solution automation
Jan 6 2014 ISI Kolkata ML for MT 4
Taxonomy of MT systems
MT Approaches
Knowledge Based Rule Based MT
Data driven Machine Learning Based
Example Based MT (EBMT)
Statistical MT
Interlingua Based Transfer Based
Jan 6 2014 ISI Kolkata ML for MT 5
Why is MT difficult
Language divergence
Jan 6 2014 ISI Kolkata ML for MT 6
Why is MT difficult Language Divergence
One of the main complexities of MT Language Divergence
Languages have different ways of expressing meaning
Lexico-Semantic Divergence
Structural Divergence
Our work on English-IL Language Divergence with illustrations from Hindi (Dave Parikh Bhattacharyya Journal of MT 2002)
Jan 6 2014 ISI Kolkata ML for MT 7
Languages differ in expressing thoughts Agglutination
Finnish ldquoistahtaisinkohanrdquo
English I wonder if I should sit down for a whileldquo
Analysis
ist + sit verb stem
ahta + verb derivation morpheme to do something for a while
isi + conditional affix
n + 1st person singular suffix
ko + question particle
han a particle for things like reminder (with declaratives) or softening (with questions and imperatives)
Jan 6 2014 ISI Kolkata ML for MT 8
Language Divergence Theory Lexico-
Semantic Divergences (few examples)
Conflational divergence F vomir E to be sick E stab H chure se maaranaa (knife-with hit) S Utrymningsplan E escape plan
Categorial divergence
Change is in POS category
The play is on_PREP (vs The play is Sunday) Khel chal_rahaa_haai_VM (vs khel ravivaar ko haai)
Jan 6 2014 ISI Kolkata ML for MT 9
Language Divergence Theory Structural Divergences
SVOSOV E Peter plays basketball H piitar basketball kheltaa haai
Head swapping divergence E Prime Minister of India H bhaarat ke pradhaan mantrii (India-of Prime
Minister)
Jan 6 2014 ISI Kolkata ML for MT 10
Language Divergence Theory Syntactic
Divergences (few examples)
Constituent Order divergence
E Singh the PM of India will address the nation today
H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)
Adjunction Divergence
E She will visit here in the summer
H vah yahaa garmii meM aayegii (she here summer-in will come)
Preposition-Stranding divergence
E Who do you want to go with
H kisake saath aap jaanaa chaahate ho (who withhellip)
Jan 6 2014 ISI Kolkata ML for MT 11
Vauquois Triangle
Jan 6 2014 ISI Kolkata ML for MT 12
Kinds of MT Systems (point of entry from source to the target text)
Jan 6 2014 ISI Kolkata ML for MT 13
Illustration of transfer SVOSOV
S
NP VP
V N NP
N John eats
bread
S
NP VP
V N
John eats
NP
N
bread
(transfer svo sov)
Jan 6 2014 ISI Kolkata ML for MT 14
Universality hypothesis
Universality hypothesis At the
level of ldquodeep meaningrdquo all texts
are the ldquosamerdquo whatever the
language
Jan 6 2014 ISI Kolkata ML for MT 15
Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)
H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स
अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke
maadhyam se apne raajaswa ko badhaayaa
G11 Government_(ergative) elections_after Mumbai_in
taxes_through its revenue_(accusative) increased
E11 The Government increased its revenue after the
elections through taxes in Mumbai
Jan 6 2014 ISI Kolkata ML for MT 16
Interlingual representation complete disambiguation
Washington voted Washington to power Vote
past
Washington power Washington
emphasis
ltis-a gt action
ltis-a gt place
ltis-a gt capability ltis-a gt hellip
ltis-a gt person
goal
Jan 6 2014 ISI Kolkata ML for MT 17
Kinds of disambiguation needed for a complete and correct interlingua graph
N Name
P POS
A Attachment
S Sense
C Co-reference
R Semantic Role
Jan 6 2014 ISI Kolkata ML for MT 18
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech Noun or Verb
Jan 6 2014 ISI Kolkata ML for MT 19
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
John is the name of a PERSON
Jan 6 2014 ISI Kolkata ML for MT 20
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD Financial
bank or River bank
Jan 6 2014 ISI Kolkata ML for MT 21
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
ldquoitrdquo ldquobankrdquo
Jan 6 2014 ISI Kolkata ML for MT 22
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
Subject Drop
Pro drop (subject ldquoIrdquo)
Jan 6 2014 ISI Kolkata ML for MT 23
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
NLP Layer
Morphology
POS tagging
Chunking
Parsing
Semantics Extraction
Discourse and Corefernce
Increased Complexity Of Processing
Jan 6 2014 ISI Kolkata ML for MT 3
Motivation for MT
MT NLP Complete
NLP AI complete
AI CS complete
How will the world be different when the language barrier disappears
Volume of text required to be translated currently exceeds translatorsrsquo capacity (demand gt supply)
Solution automation
Jan 6 2014 ISI Kolkata ML for MT 4
Taxonomy of MT systems
MT Approaches
Knowledge Based Rule Based MT
Data driven Machine Learning Based
Example Based MT (EBMT)
Statistical MT
Interlingua Based Transfer Based
Jan 6 2014 ISI Kolkata ML for MT 5
Why is MT difficult
Language divergence
Jan 6 2014 ISI Kolkata ML for MT 6
Why is MT difficult Language Divergence
One of the main complexities of MT Language Divergence
Languages have different ways of expressing meaning
Lexico-Semantic Divergence
Structural Divergence
Our work on English-IL Language Divergence with illustrations from Hindi (Dave Parikh Bhattacharyya Journal of MT 2002)
Jan 6 2014 ISI Kolkata ML for MT 7
Languages differ in expressing thoughts Agglutination
Finnish ldquoistahtaisinkohanrdquo
English I wonder if I should sit down for a whileldquo
Analysis
ist + sit verb stem
ahta + verb derivation morpheme to do something for a while
isi + conditional affix
n + 1st person singular suffix
ko + question particle
han a particle for things like reminder (with declaratives) or softening (with questions and imperatives)
Jan 6 2014 ISI Kolkata ML for MT 8
Language Divergence Theory Lexico-
Semantic Divergences (few examples)
Conflational divergence F vomir E to be sick E stab H chure se maaranaa (knife-with hit) S Utrymningsplan E escape plan
Categorial divergence
Change is in POS category
The play is on_PREP (vs The play is Sunday) Khel chal_rahaa_haai_VM (vs khel ravivaar ko haai)
Jan 6 2014 ISI Kolkata ML for MT 9
Language Divergence Theory Structural Divergences
SVOSOV E Peter plays basketball H piitar basketball kheltaa haai
Head swapping divergence E Prime Minister of India H bhaarat ke pradhaan mantrii (India-of Prime
Minister)
Jan 6 2014 ISI Kolkata ML for MT 10
Language Divergence Theory Syntactic
Divergences (few examples)
Constituent Order divergence
E Singh the PM of India will address the nation today
H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)
Adjunction Divergence
E She will visit here in the summer
H vah yahaa garmii meM aayegii (she here summer-in will come)
Preposition-Stranding divergence
E Who do you want to go with
H kisake saath aap jaanaa chaahate ho (who withhellip)
Jan 6 2014 ISI Kolkata ML for MT 11
Vauquois Triangle
Jan 6 2014 ISI Kolkata ML for MT 12
Kinds of MT Systems (point of entry from source to the target text)
Jan 6 2014 ISI Kolkata ML for MT 13
Illustration of transfer SVOSOV
S
NP VP
V N NP
N John eats
bread
S
NP VP
V N
John eats
NP
N
bread
(transfer svo sov)
Jan 6 2014 ISI Kolkata ML for MT 14
Universality hypothesis
Universality hypothesis At the
level of ldquodeep meaningrdquo all texts
are the ldquosamerdquo whatever the
language
Jan 6 2014 ISI Kolkata ML for MT 15
Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)
H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स
अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke
maadhyam se apne raajaswa ko badhaayaa
G11 Government_(ergative) elections_after Mumbai_in
taxes_through its revenue_(accusative) increased
E11 The Government increased its revenue after the
elections through taxes in Mumbai
Jan 6 2014 ISI Kolkata ML for MT 16
Interlingual representation complete disambiguation
Washington voted Washington to power Vote
past
Washington power Washington
emphasis
ltis-a gt action
ltis-a gt place
ltis-a gt capability ltis-a gt hellip
ltis-a gt person
goal
Jan 6 2014 ISI Kolkata ML for MT 17
Kinds of disambiguation needed for a complete and correct interlingua graph
N Name
P POS
A Attachment
S Sense
C Co-reference
R Semantic Role
Jan 6 2014 ISI Kolkata ML for MT 18
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech Noun or Verb
Jan 6 2014 ISI Kolkata ML for MT 19
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
John is the name of a PERSON
Jan 6 2014 ISI Kolkata ML for MT 20
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD Financial
bank or River bank
Jan 6 2014 ISI Kolkata ML for MT 21
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
ldquoitrdquo ldquobankrdquo
Jan 6 2014 ISI Kolkata ML for MT 22
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
Subject Drop
Pro drop (subject ldquoIrdquo)
Jan 6 2014 ISI Kolkata ML for MT 23
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Motivation for MT
MT NLP Complete
NLP AI complete
AI CS complete
How will the world be different when the language barrier disappears
Volume of text required to be translated currently exceeds translatorsrsquo capacity (demand gt supply)
Solution automation
Jan 6 2014 ISI Kolkata ML for MT 4
Taxonomy of MT systems
MT Approaches
Knowledge Based Rule Based MT
Data driven Machine Learning Based
Example Based MT (EBMT)
Statistical MT
Interlingua Based Transfer Based
Jan 6 2014 ISI Kolkata ML for MT 5
Why is MT difficult
Language divergence
Jan 6 2014 ISI Kolkata ML for MT 6
Why is MT difficult Language Divergence
One of the main complexities of MT Language Divergence
Languages have different ways of expressing meaning
Lexico-Semantic Divergence
Structural Divergence
Our work on English-IL Language Divergence with illustrations from Hindi (Dave Parikh Bhattacharyya Journal of MT 2002)
Jan 6 2014 ISI Kolkata ML for MT 7
Languages differ in expressing thoughts Agglutination
Finnish ldquoistahtaisinkohanrdquo
English I wonder if I should sit down for a whileldquo
Analysis
ist + sit verb stem
ahta + verb derivation morpheme to do something for a while
isi + conditional affix
n + 1st person singular suffix
ko + question particle
han a particle for things like reminder (with declaratives) or softening (with questions and imperatives)
Jan 6 2014 ISI Kolkata ML for MT 8
Language Divergence Theory Lexico-
Semantic Divergences (few examples)
Conflational divergence F vomir E to be sick E stab H chure se maaranaa (knife-with hit) S Utrymningsplan E escape plan
Categorial divergence
Change is in POS category
The play is on_PREP (vs The play is Sunday) Khel chal_rahaa_haai_VM (vs khel ravivaar ko haai)
Jan 6 2014 ISI Kolkata ML for MT 9
Language Divergence Theory Structural Divergences
SVOSOV E Peter plays basketball H piitar basketball kheltaa haai
Head swapping divergence E Prime Minister of India H bhaarat ke pradhaan mantrii (India-of Prime
Minister)
Jan 6 2014 ISI Kolkata ML for MT 10
Language Divergence Theory Syntactic
Divergences (few examples)
Constituent Order divergence
E Singh the PM of India will address the nation today
H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)
Adjunction Divergence
E She will visit here in the summer
H vah yahaa garmii meM aayegii (she here summer-in will come)
Preposition-Stranding divergence
E Who do you want to go with
H kisake saath aap jaanaa chaahate ho (who withhellip)
Jan 6 2014 ISI Kolkata ML for MT 11
Vauquois Triangle
Jan 6 2014 ISI Kolkata ML for MT 12
Kinds of MT Systems (point of entry from source to the target text)
Jan 6 2014 ISI Kolkata ML for MT 13
Illustration of transfer SVOSOV
S
NP VP
V N NP
N John eats
bread
S
NP VP
V N
John eats
NP
N
bread
(transfer svo sov)
Jan 6 2014 ISI Kolkata ML for MT 14
Universality hypothesis
Universality hypothesis At the
level of ldquodeep meaningrdquo all texts
are the ldquosamerdquo whatever the
language
Jan 6 2014 ISI Kolkata ML for MT 15
Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)
H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स
अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke
maadhyam se apne raajaswa ko badhaayaa
G11 Government_(ergative) elections_after Mumbai_in
taxes_through its revenue_(accusative) increased
E11 The Government increased its revenue after the
elections through taxes in Mumbai
Jan 6 2014 ISI Kolkata ML for MT 16
Interlingual representation complete disambiguation
Washington voted Washington to power Vote
past
Washington power Washington
emphasis
ltis-a gt action
ltis-a gt place
ltis-a gt capability ltis-a gt hellip
ltis-a gt person
goal
Jan 6 2014 ISI Kolkata ML for MT 17
Kinds of disambiguation needed for a complete and correct interlingua graph
N Name
P POS
A Attachment
S Sense
C Co-reference
R Semantic Role
Jan 6 2014 ISI Kolkata ML for MT 18
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech Noun or Verb
Jan 6 2014 ISI Kolkata ML for MT 19
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
John is the name of a PERSON
Jan 6 2014 ISI Kolkata ML for MT 20
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD Financial
bank or River bank
Jan 6 2014 ISI Kolkata ML for MT 21
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
ldquoitrdquo ldquobankrdquo
Jan 6 2014 ISI Kolkata ML for MT 22
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
Subject Drop
Pro drop (subject ldquoIrdquo)
Jan 6 2014 ISI Kolkata ML for MT 23
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Taxonomy of MT systems
MT Approaches
Knowledge Based Rule Based MT
Data driven Machine Learning Based
Example Based MT (EBMT)
Statistical MT
Interlingua Based Transfer Based
Jan 6 2014 ISI Kolkata ML for MT 5
Why is MT difficult
Language divergence
Jan 6 2014 ISI Kolkata ML for MT 6
Why is MT difficult Language Divergence
One of the main complexities of MT Language Divergence
Languages have different ways of expressing meaning
Lexico-Semantic Divergence
Structural Divergence
Our work on English-IL Language Divergence with illustrations from Hindi (Dave Parikh Bhattacharyya Journal of MT 2002)
Jan 6 2014 ISI Kolkata ML for MT 7
Languages differ in expressing thoughts Agglutination
Finnish ldquoistahtaisinkohanrdquo
English I wonder if I should sit down for a whileldquo
Analysis
ist + sit verb stem
ahta + verb derivation morpheme to do something for a while
isi + conditional affix
n + 1st person singular suffix
ko + question particle
han a particle for things like reminder (with declaratives) or softening (with questions and imperatives)
Jan 6 2014 ISI Kolkata ML for MT 8
Language Divergence Theory Lexico-
Semantic Divergences (few examples)
Conflational divergence F vomir E to be sick E stab H chure se maaranaa (knife-with hit) S Utrymningsplan E escape plan
Categorial divergence
Change is in POS category
The play is on_PREP (vs The play is Sunday) Khel chal_rahaa_haai_VM (vs khel ravivaar ko haai)
Jan 6 2014 ISI Kolkata ML for MT 9
Language Divergence Theory Structural Divergences
SVOSOV E Peter plays basketball H piitar basketball kheltaa haai
Head swapping divergence E Prime Minister of India H bhaarat ke pradhaan mantrii (India-of Prime
Minister)
Jan 6 2014 ISI Kolkata ML for MT 10
Language Divergence Theory Syntactic
Divergences (few examples)
Constituent Order divergence
E Singh the PM of India will address the nation today
H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)
Adjunction Divergence
E She will visit here in the summer
H vah yahaa garmii meM aayegii (she here summer-in will come)
Preposition-Stranding divergence
E Who do you want to go with
H kisake saath aap jaanaa chaahate ho (who withhellip)
Jan 6 2014 ISI Kolkata ML for MT 11
Vauquois Triangle
Jan 6 2014 ISI Kolkata ML for MT 12
Kinds of MT Systems (point of entry from source to the target text)
Jan 6 2014 ISI Kolkata ML for MT 13
Illustration of transfer SVOSOV
S
NP VP
V N NP
N John eats
bread
S
NP VP
V N
John eats
NP
N
bread
(transfer svo sov)
Jan 6 2014 ISI Kolkata ML for MT 14
Universality hypothesis
Universality hypothesis At the
level of ldquodeep meaningrdquo all texts
are the ldquosamerdquo whatever the
language
Jan 6 2014 ISI Kolkata ML for MT 15
Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)
H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स
अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke
maadhyam se apne raajaswa ko badhaayaa
G11 Government_(ergative) elections_after Mumbai_in
taxes_through its revenue_(accusative) increased
E11 The Government increased its revenue after the
elections through taxes in Mumbai
Jan 6 2014 ISI Kolkata ML for MT 16
Interlingual representation complete disambiguation
Washington voted Washington to power Vote
past
Washington power Washington
emphasis
ltis-a gt action
ltis-a gt place
ltis-a gt capability ltis-a gt hellip
ltis-a gt person
goal
Jan 6 2014 ISI Kolkata ML for MT 17
Kinds of disambiguation needed for a complete and correct interlingua graph
N Name
P POS
A Attachment
S Sense
C Co-reference
R Semantic Role
Jan 6 2014 ISI Kolkata ML for MT 18
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech Noun or Verb
Jan 6 2014 ISI Kolkata ML for MT 19
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
John is the name of a PERSON
Jan 6 2014 ISI Kolkata ML for MT 20
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD Financial
bank or River bank
Jan 6 2014 ISI Kolkata ML for MT 21
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
ldquoitrdquo ldquobankrdquo
Jan 6 2014 ISI Kolkata ML for MT 22
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
Subject Drop
Pro drop (subject ldquoIrdquo)
Jan 6 2014 ISI Kolkata ML for MT 23
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Why is MT difficult
Language divergence
Jan 6 2014 ISI Kolkata ML for MT 6
Why is MT difficult Language Divergence
One of the main complexities of MT Language Divergence
Languages have different ways of expressing meaning
Lexico-Semantic Divergence
Structural Divergence
Our work on English-IL Language Divergence with illustrations from Hindi (Dave Parikh Bhattacharyya Journal of MT 2002)
Jan 6 2014 ISI Kolkata ML for MT 7
Languages differ in expressing thoughts Agglutination
Finnish ldquoistahtaisinkohanrdquo
English I wonder if I should sit down for a whileldquo
Analysis
ist + sit verb stem
ahta + verb derivation morpheme to do something for a while
isi + conditional affix
n + 1st person singular suffix
ko + question particle
han a particle for things like reminder (with declaratives) or softening (with questions and imperatives)
Jan 6 2014 ISI Kolkata ML for MT 8
Language Divergence Theory Lexico-
Semantic Divergences (few examples)
Conflational divergence F vomir E to be sick E stab H chure se maaranaa (knife-with hit) S Utrymningsplan E escape plan
Categorial divergence
Change is in POS category
The play is on_PREP (vs The play is Sunday) Khel chal_rahaa_haai_VM (vs khel ravivaar ko haai)
Jan 6 2014 ISI Kolkata ML for MT 9
Language Divergence Theory Structural Divergences
SVOSOV E Peter plays basketball H piitar basketball kheltaa haai
Head swapping divergence E Prime Minister of India H bhaarat ke pradhaan mantrii (India-of Prime
Minister)
Jan 6 2014 ISI Kolkata ML for MT 10
Language Divergence Theory Syntactic
Divergences (few examples)
Constituent Order divergence
E Singh the PM of India will address the nation today
H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)
Adjunction Divergence
E She will visit here in the summer
H vah yahaa garmii meM aayegii (she here summer-in will come)
Preposition-Stranding divergence
E Who do you want to go with
H kisake saath aap jaanaa chaahate ho (who withhellip)
Jan 6 2014 ISI Kolkata ML for MT 11
Vauquois Triangle
Jan 6 2014 ISI Kolkata ML for MT 12
Kinds of MT Systems (point of entry from source to the target text)
Jan 6 2014 ISI Kolkata ML for MT 13
Illustration of transfer SVOSOV
S
NP VP
V N NP
N John eats
bread
S
NP VP
V N
John eats
NP
N
bread
(transfer svo sov)
Jan 6 2014 ISI Kolkata ML for MT 14
Universality hypothesis
Universality hypothesis At the
level of ldquodeep meaningrdquo all texts
are the ldquosamerdquo whatever the
language
Jan 6 2014 ISI Kolkata ML for MT 15
Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)
H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स
अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke
maadhyam se apne raajaswa ko badhaayaa
G11 Government_(ergative) elections_after Mumbai_in
taxes_through its revenue_(accusative) increased
E11 The Government increased its revenue after the
elections through taxes in Mumbai
Jan 6 2014 ISI Kolkata ML for MT 16
Interlingual representation complete disambiguation
Washington voted Washington to power Vote
past
Washington power Washington
emphasis
ltis-a gt action
ltis-a gt place
ltis-a gt capability ltis-a gt hellip
ltis-a gt person
goal
Jan 6 2014 ISI Kolkata ML for MT 17
Kinds of disambiguation needed for a complete and correct interlingua graph
N Name
P POS
A Attachment
S Sense
C Co-reference
R Semantic Role
Jan 6 2014 ISI Kolkata ML for MT 18
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech Noun or Verb
Jan 6 2014 ISI Kolkata ML for MT 19
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
John is the name of a PERSON
Jan 6 2014 ISI Kolkata ML for MT 20
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD Financial
bank or River bank
Jan 6 2014 ISI Kolkata ML for MT 21
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
ldquoitrdquo ldquobankrdquo
Jan 6 2014 ISI Kolkata ML for MT 22
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
Subject Drop
Pro drop (subject ldquoIrdquo)
Jan 6 2014 ISI Kolkata ML for MT 23
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Why is MT difficult Language Divergence
One of the main complexities of MT Language Divergence
Languages have different ways of expressing meaning
Lexico-Semantic Divergence
Structural Divergence
Our work on English-IL Language Divergence with illustrations from Hindi (Dave Parikh Bhattacharyya Journal of MT 2002)
Jan 6 2014 ISI Kolkata ML for MT 7
Languages differ in expressing thoughts Agglutination
Finnish ldquoistahtaisinkohanrdquo
English I wonder if I should sit down for a whileldquo
Analysis
ist + sit verb stem
ahta + verb derivation morpheme to do something for a while
isi + conditional affix
n + 1st person singular suffix
ko + question particle
han a particle for things like reminder (with declaratives) or softening (with questions and imperatives)
Jan 6 2014 ISI Kolkata ML for MT 8
Language Divergence Theory Lexico-
Semantic Divergences (few examples)
Conflational divergence F vomir E to be sick E stab H chure se maaranaa (knife-with hit) S Utrymningsplan E escape plan
Categorial divergence
Change is in POS category
The play is on_PREP (vs The play is Sunday) Khel chal_rahaa_haai_VM (vs khel ravivaar ko haai)
Jan 6 2014 ISI Kolkata ML for MT 9
Language Divergence Theory Structural Divergences
SVOSOV E Peter plays basketball H piitar basketball kheltaa haai
Head swapping divergence E Prime Minister of India H bhaarat ke pradhaan mantrii (India-of Prime
Minister)
Jan 6 2014 ISI Kolkata ML for MT 10
Language Divergence Theory Syntactic
Divergences (few examples)
Constituent Order divergence
E Singh the PM of India will address the nation today
H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)
Adjunction Divergence
E She will visit here in the summer
H vah yahaa garmii meM aayegii (she here summer-in will come)
Preposition-Stranding divergence
E Who do you want to go with
H kisake saath aap jaanaa chaahate ho (who withhellip)
Jan 6 2014 ISI Kolkata ML for MT 11
Vauquois Triangle
Jan 6 2014 ISI Kolkata ML for MT 12
Kinds of MT Systems (point of entry from source to the target text)
Jan 6 2014 ISI Kolkata ML for MT 13
Illustration of transfer SVOSOV
S
NP VP
V N NP
N John eats
bread
S
NP VP
V N
John eats
NP
N
bread
(transfer svo sov)
Jan 6 2014 ISI Kolkata ML for MT 14
Universality hypothesis
Universality hypothesis At the
level of ldquodeep meaningrdquo all texts
are the ldquosamerdquo whatever the
language
Jan 6 2014 ISI Kolkata ML for MT 15
Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)
H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स
अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke
maadhyam se apne raajaswa ko badhaayaa
G11 Government_(ergative) elections_after Mumbai_in
taxes_through its revenue_(accusative) increased
E11 The Government increased its revenue after the
elections through taxes in Mumbai
Jan 6 2014 ISI Kolkata ML for MT 16
Interlingual representation complete disambiguation
Washington voted Washington to power Vote
past
Washington power Washington
emphasis
ltis-a gt action
ltis-a gt place
ltis-a gt capability ltis-a gt hellip
ltis-a gt person
goal
Jan 6 2014 ISI Kolkata ML for MT 17
Kinds of disambiguation needed for a complete and correct interlingua graph
N Name
P POS
A Attachment
S Sense
C Co-reference
R Semantic Role
Jan 6 2014 ISI Kolkata ML for MT 18
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech Noun or Verb
Jan 6 2014 ISI Kolkata ML for MT 19
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
John is the name of a PERSON
Jan 6 2014 ISI Kolkata ML for MT 20
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD Financial
bank or River bank
Jan 6 2014 ISI Kolkata ML for MT 21
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
ldquoitrdquo ldquobankrdquo
Jan 6 2014 ISI Kolkata ML for MT 22
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
Subject Drop
Pro drop (subject ldquoIrdquo)
Jan 6 2014 ISI Kolkata ML for MT 23
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Languages differ in expressing thoughts Agglutination
Finnish ldquoistahtaisinkohanrdquo
English I wonder if I should sit down for a whileldquo
Analysis
ist + sit verb stem
ahta + verb derivation morpheme to do something for a while
isi + conditional affix
n + 1st person singular suffix
ko + question particle
han a particle for things like reminder (with declaratives) or softening (with questions and imperatives)
Jan 6 2014 ISI Kolkata ML for MT 8
Language Divergence Theory Lexico-
Semantic Divergences (few examples)
Conflational divergence F vomir E to be sick E stab H chure se maaranaa (knife-with hit) S Utrymningsplan E escape plan
Categorial divergence
Change is in POS category
The play is on_PREP (vs The play is Sunday) Khel chal_rahaa_haai_VM (vs khel ravivaar ko haai)
Jan 6 2014 ISI Kolkata ML for MT 9
Language Divergence Theory Structural Divergences
SVOSOV E Peter plays basketball H piitar basketball kheltaa haai
Head swapping divergence E Prime Minister of India H bhaarat ke pradhaan mantrii (India-of Prime
Minister)
Jan 6 2014 ISI Kolkata ML for MT 10
Language Divergence Theory Syntactic
Divergences (few examples)
Constituent Order divergence
E Singh the PM of India will address the nation today
H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)
Adjunction Divergence
E She will visit here in the summer
H vah yahaa garmii meM aayegii (she here summer-in will come)
Preposition-Stranding divergence
E Who do you want to go with
H kisake saath aap jaanaa chaahate ho (who withhellip)
Jan 6 2014 ISI Kolkata ML for MT 11
Vauquois Triangle
Jan 6 2014 ISI Kolkata ML for MT 12
Kinds of MT Systems (point of entry from source to the target text)
Jan 6 2014 ISI Kolkata ML for MT 13
Illustration of transfer SVOSOV
S
NP VP
V N NP
N John eats
bread
S
NP VP
V N
John eats
NP
N
bread
(transfer svo sov)
Jan 6 2014 ISI Kolkata ML for MT 14
Universality hypothesis
Universality hypothesis At the
level of ldquodeep meaningrdquo all texts
are the ldquosamerdquo whatever the
language
Jan 6 2014 ISI Kolkata ML for MT 15
Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)
H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स
अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke
maadhyam se apne raajaswa ko badhaayaa
G11 Government_(ergative) elections_after Mumbai_in
taxes_through its revenue_(accusative) increased
E11 The Government increased its revenue after the
elections through taxes in Mumbai
Jan 6 2014 ISI Kolkata ML for MT 16
Interlingual representation complete disambiguation
Washington voted Washington to power Vote
past
Washington power Washington
emphasis
ltis-a gt action
ltis-a gt place
ltis-a gt capability ltis-a gt hellip
ltis-a gt person
goal
Jan 6 2014 ISI Kolkata ML for MT 17
Kinds of disambiguation needed for a complete and correct interlingua graph
N Name
P POS
A Attachment
S Sense
C Co-reference
R Semantic Role
Jan 6 2014 ISI Kolkata ML for MT 18
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech Noun or Verb
Jan 6 2014 ISI Kolkata ML for MT 19
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
John is the name of a PERSON
Jan 6 2014 ISI Kolkata ML for MT 20
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD Financial
bank or River bank
Jan 6 2014 ISI Kolkata ML for MT 21
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
ldquoitrdquo ldquobankrdquo
Jan 6 2014 ISI Kolkata ML for MT 22
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
Subject Drop
Pro drop (subject ldquoIrdquo)
Jan 6 2014 ISI Kolkata ML for MT 23
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Language Divergence Theory Lexico-
Semantic Divergences (few examples)
Conflational divergence F vomir E to be sick E stab H chure se maaranaa (knife-with hit) S Utrymningsplan E escape plan
Categorial divergence
Change is in POS category
The play is on_PREP (vs The play is Sunday) Khel chal_rahaa_haai_VM (vs khel ravivaar ko haai)
Jan 6 2014 ISI Kolkata ML for MT 9
Language Divergence Theory Structural Divergences
SVOSOV E Peter plays basketball H piitar basketball kheltaa haai
Head swapping divergence E Prime Minister of India H bhaarat ke pradhaan mantrii (India-of Prime
Minister)
Jan 6 2014 ISI Kolkata ML for MT 10
Language Divergence Theory Syntactic
Divergences (few examples)
Constituent Order divergence
E Singh the PM of India will address the nation today
H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)
Adjunction Divergence
E She will visit here in the summer
H vah yahaa garmii meM aayegii (she here summer-in will come)
Preposition-Stranding divergence
E Who do you want to go with
H kisake saath aap jaanaa chaahate ho (who withhellip)
Jan 6 2014 ISI Kolkata ML for MT 11
Vauquois Triangle
Jan 6 2014 ISI Kolkata ML for MT 12
Kinds of MT Systems (point of entry from source to the target text)
Jan 6 2014 ISI Kolkata ML for MT 13
Illustration of transfer SVOSOV
S
NP VP
V N NP
N John eats
bread
S
NP VP
V N
John eats
NP
N
bread
(transfer svo sov)
Jan 6 2014 ISI Kolkata ML for MT 14
Universality hypothesis
Universality hypothesis At the
level of ldquodeep meaningrdquo all texts
are the ldquosamerdquo whatever the
language
Jan 6 2014 ISI Kolkata ML for MT 15
Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)
H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स
अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke
maadhyam se apne raajaswa ko badhaayaa
G11 Government_(ergative) elections_after Mumbai_in
taxes_through its revenue_(accusative) increased
E11 The Government increased its revenue after the
elections through taxes in Mumbai
Jan 6 2014 ISI Kolkata ML for MT 16
Interlingual representation complete disambiguation
Washington voted Washington to power Vote
past
Washington power Washington
emphasis
ltis-a gt action
ltis-a gt place
ltis-a gt capability ltis-a gt hellip
ltis-a gt person
goal
Jan 6 2014 ISI Kolkata ML for MT 17
Kinds of disambiguation needed for a complete and correct interlingua graph
N Name
P POS
A Attachment
S Sense
C Co-reference
R Semantic Role
Jan 6 2014 ISI Kolkata ML for MT 18
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech Noun or Verb
Jan 6 2014 ISI Kolkata ML for MT 19
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
John is the name of a PERSON
Jan 6 2014 ISI Kolkata ML for MT 20
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD Financial
bank or River bank
Jan 6 2014 ISI Kolkata ML for MT 21
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
ldquoitrdquo ldquobankrdquo
Jan 6 2014 ISI Kolkata ML for MT 22
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
Subject Drop
Pro drop (subject ldquoIrdquo)
Jan 6 2014 ISI Kolkata ML for MT 23
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Language Divergence Theory Structural Divergences
SVOSOV E Peter plays basketball H piitar basketball kheltaa haai
Head swapping divergence E Prime Minister of India H bhaarat ke pradhaan mantrii (India-of Prime
Minister)
Jan 6 2014 ISI Kolkata ML for MT 10
Language Divergence Theory Syntactic
Divergences (few examples)
Constituent Order divergence
E Singh the PM of India will address the nation today
H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)
Adjunction Divergence
E She will visit here in the summer
H vah yahaa garmii meM aayegii (she here summer-in will come)
Preposition-Stranding divergence
E Who do you want to go with
H kisake saath aap jaanaa chaahate ho (who withhellip)
Jan 6 2014 ISI Kolkata ML for MT 11
Vauquois Triangle
Jan 6 2014 ISI Kolkata ML for MT 12
Kinds of MT Systems (point of entry from source to the target text)
Jan 6 2014 ISI Kolkata ML for MT 13
Illustration of transfer SVOSOV
S
NP VP
V N NP
N John eats
bread
S
NP VP
V N
John eats
NP
N
bread
(transfer svo sov)
Jan 6 2014 ISI Kolkata ML for MT 14
Universality hypothesis
Universality hypothesis At the
level of ldquodeep meaningrdquo all texts
are the ldquosamerdquo whatever the
language
Jan 6 2014 ISI Kolkata ML for MT 15
Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)
H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स
अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke
maadhyam se apne raajaswa ko badhaayaa
G11 Government_(ergative) elections_after Mumbai_in
taxes_through its revenue_(accusative) increased
E11 The Government increased its revenue after the
elections through taxes in Mumbai
Jan 6 2014 ISI Kolkata ML for MT 16
Interlingual representation complete disambiguation
Washington voted Washington to power Vote
past
Washington power Washington
emphasis
ltis-a gt action
ltis-a gt place
ltis-a gt capability ltis-a gt hellip
ltis-a gt person
goal
Jan 6 2014 ISI Kolkata ML for MT 17
Kinds of disambiguation needed for a complete and correct interlingua graph
N Name
P POS
A Attachment
S Sense
C Co-reference
R Semantic Role
Jan 6 2014 ISI Kolkata ML for MT 18
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech Noun or Verb
Jan 6 2014 ISI Kolkata ML for MT 19
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
John is the name of a PERSON
Jan 6 2014 ISI Kolkata ML for MT 20
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD Financial
bank or River bank
Jan 6 2014 ISI Kolkata ML for MT 21
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
ldquoitrdquo ldquobankrdquo
Jan 6 2014 ISI Kolkata ML for MT 22
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
Subject Drop
Pro drop (subject ldquoIrdquo)
Jan 6 2014 ISI Kolkata ML for MT 23
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Language Divergence Theory Syntactic
Divergences (few examples)
Constituent Order divergence
E Singh the PM of India will address the nation today
H bhaarat ke pradhaan mantrii singh hellip (India-of PM Singhhellip)
Adjunction Divergence
E She will visit here in the summer
H vah yahaa garmii meM aayegii (she here summer-in will come)
Preposition-Stranding divergence
E Who do you want to go with
H kisake saath aap jaanaa chaahate ho (who withhellip)
Jan 6 2014 ISI Kolkata ML for MT 11
Vauquois Triangle
Jan 6 2014 ISI Kolkata ML for MT 12
Kinds of MT Systems (point of entry from source to the target text)
Jan 6 2014 ISI Kolkata ML for MT 13
Illustration of transfer SVOSOV
S
NP VP
V N NP
N John eats
bread
S
NP VP
V N
John eats
NP
N
bread
(transfer svo sov)
Jan 6 2014 ISI Kolkata ML for MT 14
Universality hypothesis
Universality hypothesis At the
level of ldquodeep meaningrdquo all texts
are the ldquosamerdquo whatever the
language
Jan 6 2014 ISI Kolkata ML for MT 15
Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)
H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स
अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke
maadhyam se apne raajaswa ko badhaayaa
G11 Government_(ergative) elections_after Mumbai_in
taxes_through its revenue_(accusative) increased
E11 The Government increased its revenue after the
elections through taxes in Mumbai
Jan 6 2014 ISI Kolkata ML for MT 16
Interlingual representation complete disambiguation
Washington voted Washington to power Vote
past
Washington power Washington
emphasis
ltis-a gt action
ltis-a gt place
ltis-a gt capability ltis-a gt hellip
ltis-a gt person
goal
Jan 6 2014 ISI Kolkata ML for MT 17
Kinds of disambiguation needed for a complete and correct interlingua graph
N Name
P POS
A Attachment
S Sense
C Co-reference
R Semantic Role
Jan 6 2014 ISI Kolkata ML for MT 18
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech Noun or Verb
Jan 6 2014 ISI Kolkata ML for MT 19
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
John is the name of a PERSON
Jan 6 2014 ISI Kolkata ML for MT 20
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD Financial
bank or River bank
Jan 6 2014 ISI Kolkata ML for MT 21
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
ldquoitrdquo ldquobankrdquo
Jan 6 2014 ISI Kolkata ML for MT 22
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
Subject Drop
Pro drop (subject ldquoIrdquo)
Jan 6 2014 ISI Kolkata ML for MT 23
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Vauquois Triangle
Jan 6 2014 ISI Kolkata ML for MT 12
Kinds of MT Systems (point of entry from source to the target text)
Jan 6 2014 ISI Kolkata ML for MT 13
Illustration of transfer SVOSOV
S
NP VP
V N NP
N John eats
bread
S
NP VP
V N
John eats
NP
N
bread
(transfer svo sov)
Jan 6 2014 ISI Kolkata ML for MT 14
Universality hypothesis
Universality hypothesis At the
level of ldquodeep meaningrdquo all texts
are the ldquosamerdquo whatever the
language
Jan 6 2014 ISI Kolkata ML for MT 15
Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)
H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स
अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke
maadhyam se apne raajaswa ko badhaayaa
G11 Government_(ergative) elections_after Mumbai_in
taxes_through its revenue_(accusative) increased
E11 The Government increased its revenue after the
elections through taxes in Mumbai
Jan 6 2014 ISI Kolkata ML for MT 16
Interlingual representation complete disambiguation
Washington voted Washington to power Vote
past
Washington power Washington
emphasis
ltis-a gt action
ltis-a gt place
ltis-a gt capability ltis-a gt hellip
ltis-a gt person
goal
Jan 6 2014 ISI Kolkata ML for MT 17
Kinds of disambiguation needed for a complete and correct interlingua graph
N Name
P POS
A Attachment
S Sense
C Co-reference
R Semantic Role
Jan 6 2014 ISI Kolkata ML for MT 18
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech Noun or Verb
Jan 6 2014 ISI Kolkata ML for MT 19
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
John is the name of a PERSON
Jan 6 2014 ISI Kolkata ML for MT 20
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD Financial
bank or River bank
Jan 6 2014 ISI Kolkata ML for MT 21
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
ldquoitrdquo ldquobankrdquo
Jan 6 2014 ISI Kolkata ML for MT 22
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
Subject Drop
Pro drop (subject ldquoIrdquo)
Jan 6 2014 ISI Kolkata ML for MT 23
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Kinds of MT Systems (point of entry from source to the target text)
Jan 6 2014 ISI Kolkata ML for MT 13
Illustration of transfer SVOSOV
S
NP VP
V N NP
N John eats
bread
S
NP VP
V N
John eats
NP
N
bread
(transfer svo sov)
Jan 6 2014 ISI Kolkata ML for MT 14
Universality hypothesis
Universality hypothesis At the
level of ldquodeep meaningrdquo all texts
are the ldquosamerdquo whatever the
language
Jan 6 2014 ISI Kolkata ML for MT 15
Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)
H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स
अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke
maadhyam se apne raajaswa ko badhaayaa
G11 Government_(ergative) elections_after Mumbai_in
taxes_through its revenue_(accusative) increased
E11 The Government increased its revenue after the
elections through taxes in Mumbai
Jan 6 2014 ISI Kolkata ML for MT 16
Interlingual representation complete disambiguation
Washington voted Washington to power Vote
past
Washington power Washington
emphasis
ltis-a gt action
ltis-a gt place
ltis-a gt capability ltis-a gt hellip
ltis-a gt person
goal
Jan 6 2014 ISI Kolkata ML for MT 17
Kinds of disambiguation needed for a complete and correct interlingua graph
N Name
P POS
A Attachment
S Sense
C Co-reference
R Semantic Role
Jan 6 2014 ISI Kolkata ML for MT 18
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech Noun or Verb
Jan 6 2014 ISI Kolkata ML for MT 19
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
John is the name of a PERSON
Jan 6 2014 ISI Kolkata ML for MT 20
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD Financial
bank or River bank
Jan 6 2014 ISI Kolkata ML for MT 21
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
ldquoitrdquo ldquobankrdquo
Jan 6 2014 ISI Kolkata ML for MT 22
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
Subject Drop
Pro drop (subject ldquoIrdquo)
Jan 6 2014 ISI Kolkata ML for MT 23
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Illustration of transfer SVOSOV
S
NP VP
V N NP
N John eats
bread
S
NP VP
V N
John eats
NP
N
bread
(transfer svo sov)
Jan 6 2014 ISI Kolkata ML for MT 14
Universality hypothesis
Universality hypothesis At the
level of ldquodeep meaningrdquo all texts
are the ldquosamerdquo whatever the
language
Jan 6 2014 ISI Kolkata ML for MT 15
Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)
H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स
अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke
maadhyam se apne raajaswa ko badhaayaa
G11 Government_(ergative) elections_after Mumbai_in
taxes_through its revenue_(accusative) increased
E11 The Government increased its revenue after the
elections through taxes in Mumbai
Jan 6 2014 ISI Kolkata ML for MT 16
Interlingual representation complete disambiguation
Washington voted Washington to power Vote
past
Washington power Washington
emphasis
ltis-a gt action
ltis-a gt place
ltis-a gt capability ltis-a gt hellip
ltis-a gt person
goal
Jan 6 2014 ISI Kolkata ML for MT 17
Kinds of disambiguation needed for a complete and correct interlingua graph
N Name
P POS
A Attachment
S Sense
C Co-reference
R Semantic Role
Jan 6 2014 ISI Kolkata ML for MT 18
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech Noun or Verb
Jan 6 2014 ISI Kolkata ML for MT 19
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
John is the name of a PERSON
Jan 6 2014 ISI Kolkata ML for MT 20
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD Financial
bank or River bank
Jan 6 2014 ISI Kolkata ML for MT 21
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
ldquoitrdquo ldquobankrdquo
Jan 6 2014 ISI Kolkata ML for MT 22
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
Subject Drop
Pro drop (subject ldquoIrdquo)
Jan 6 2014 ISI Kolkata ML for MT 23
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Universality hypothesis
Universality hypothesis At the
level of ldquodeep meaningrdquo all texts
are the ldquosamerdquo whatever the
language
Jan 6 2014 ISI Kolkata ML for MT 15
Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)
H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स
अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke
maadhyam se apne raajaswa ko badhaayaa
G11 Government_(ergative) elections_after Mumbai_in
taxes_through its revenue_(accusative) increased
E11 The Government increased its revenue after the
elections through taxes in Mumbai
Jan 6 2014 ISI Kolkata ML for MT 16
Interlingual representation complete disambiguation
Washington voted Washington to power Vote
past
Washington power Washington
emphasis
ltis-a gt action
ltis-a gt place
ltis-a gt capability ltis-a gt hellip
ltis-a gt person
goal
Jan 6 2014 ISI Kolkata ML for MT 17
Kinds of disambiguation needed for a complete and correct interlingua graph
N Name
P POS
A Attachment
S Sense
C Co-reference
R Semantic Role
Jan 6 2014 ISI Kolkata ML for MT 18
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech Noun or Verb
Jan 6 2014 ISI Kolkata ML for MT 19
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
John is the name of a PERSON
Jan 6 2014 ISI Kolkata ML for MT 20
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD Financial
bank or River bank
Jan 6 2014 ISI Kolkata ML for MT 21
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
ldquoitrdquo ldquobankrdquo
Jan 6 2014 ISI Kolkata ML for MT 22
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
Subject Drop
Pro drop (subject ldquoIrdquo)
Jan 6 2014 ISI Kolkata ML for MT 23
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Understanding the Analysis-Transfer-Generation over Vauquois triangle (14)
H11 सरकार_न चनावो_क_बाद मबई म करो_क_माधयम_स
अपन राजसव_को बढ़ाया | T11 Sarkaar ne chunaawo ke baad Mumbai me karoM ke
maadhyam se apne raajaswa ko badhaayaa
G11 Government_(ergative) elections_after Mumbai_in
taxes_through its revenue_(accusative) increased
E11 The Government increased its revenue after the
elections through taxes in Mumbai
Jan 6 2014 ISI Kolkata ML for MT 16
Interlingual representation complete disambiguation
Washington voted Washington to power Vote
past
Washington power Washington
emphasis
ltis-a gt action
ltis-a gt place
ltis-a gt capability ltis-a gt hellip
ltis-a gt person
goal
Jan 6 2014 ISI Kolkata ML for MT 17
Kinds of disambiguation needed for a complete and correct interlingua graph
N Name
P POS
A Attachment
S Sense
C Co-reference
R Semantic Role
Jan 6 2014 ISI Kolkata ML for MT 18
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech Noun or Verb
Jan 6 2014 ISI Kolkata ML for MT 19
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
John is the name of a PERSON
Jan 6 2014 ISI Kolkata ML for MT 20
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD Financial
bank or River bank
Jan 6 2014 ISI Kolkata ML for MT 21
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
ldquoitrdquo ldquobankrdquo
Jan 6 2014 ISI Kolkata ML for MT 22
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
Subject Drop
Pro drop (subject ldquoIrdquo)
Jan 6 2014 ISI Kolkata ML for MT 23
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Interlingual representation complete disambiguation
Washington voted Washington to power Vote
past
Washington power Washington
emphasis
ltis-a gt action
ltis-a gt place
ltis-a gt capability ltis-a gt hellip
ltis-a gt person
goal
Jan 6 2014 ISI Kolkata ML for MT 17
Kinds of disambiguation needed for a complete and correct interlingua graph
N Name
P POS
A Attachment
S Sense
C Co-reference
R Semantic Role
Jan 6 2014 ISI Kolkata ML for MT 18
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech Noun or Verb
Jan 6 2014 ISI Kolkata ML for MT 19
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
John is the name of a PERSON
Jan 6 2014 ISI Kolkata ML for MT 20
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD Financial
bank or River bank
Jan 6 2014 ISI Kolkata ML for MT 21
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
ldquoitrdquo ldquobankrdquo
Jan 6 2014 ISI Kolkata ML for MT 22
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
Subject Drop
Pro drop (subject ldquoIrdquo)
Jan 6 2014 ISI Kolkata ML for MT 23
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Kinds of disambiguation needed for a complete and correct interlingua graph
N Name
P POS
A Attachment
S Sense
C Co-reference
R Semantic Role
Jan 6 2014 ISI Kolkata ML for MT 18
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech Noun or Verb
Jan 6 2014 ISI Kolkata ML for MT 19
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
John is the name of a PERSON
Jan 6 2014 ISI Kolkata ML for MT 20
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD Financial
bank or River bank
Jan 6 2014 ISI Kolkata ML for MT 21
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
ldquoitrdquo ldquobankrdquo
Jan 6 2014 ISI Kolkata ML for MT 22
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
Subject Drop
Pro drop (subject ldquoIrdquo)
Jan 6 2014 ISI Kolkata ML for MT 23
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech Noun or Verb
Jan 6 2014 ISI Kolkata ML for MT 19
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
John is the name of a PERSON
Jan 6 2014 ISI Kolkata ML for MT 20
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD Financial
bank or River bank
Jan 6 2014 ISI Kolkata ML for MT 21
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
ldquoitrdquo ldquobankrdquo
Jan 6 2014 ISI Kolkata ML for MT 22
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
Subject Drop
Pro drop (subject ldquoIrdquo)
Jan 6 2014 ISI Kolkata ML for MT 23
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
John is the name of a PERSON
Jan 6 2014 ISI Kolkata ML for MT 20
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD Financial
bank or River bank
Jan 6 2014 ISI Kolkata ML for MT 21
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
ldquoitrdquo ldquobankrdquo
Jan 6 2014 ISI Kolkata ML for MT 22
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
Subject Drop
Pro drop (subject ldquoIrdquo)
Jan 6 2014 ISI Kolkata ML for MT 23
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD Financial
bank or River bank
Jan 6 2014 ISI Kolkata ML for MT 21
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
ldquoitrdquo ldquobankrdquo
Jan 6 2014 ISI Kolkata ML for MT 22
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
Subject Drop
Pro drop (subject ldquoIrdquo)
Jan 6 2014 ISI Kolkata ML for MT 23
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
ldquoitrdquo ldquobankrdquo
Jan 6 2014 ISI Kolkata ML for MT 22
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
Subject Drop
Pro drop (subject ldquoIrdquo)
Jan 6 2014 ISI Kolkata ML for MT 23
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Issues to handle
Sentence I went with my friend John to the bank to withdraw some money but was disappointed to find it closed
ISSUES Part Of Speech
NER
WSD
Co-reference
Subject Drop
Pro drop (subject ldquoIrdquo)
Jan 6 2014 ISI Kolkata ML for MT 23
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Typical NLP tools used
POS tagger
Stanford Named Entity Recognizer
Stanford Dependency Parser
XLE Dependency Parser
Lexical Resource
WordNet
Universal Word Dictionary (UW++)
Jan 6 2014 ISI Kolkata ML for MT 24
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
System Architecture Stanford
Dependency Parser
XLE Parser
Feature Generation
Attribute Generation
Relation Generation
Simple Sentence Analyser
NER
Stanford Dependency Parser
WSD
Clause Marker
Merger
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simple Enco
Simplifier
Jan 6 2014 ISI Kolkata ML for MT 25
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Target Sentence Generation from interlingua
Lexical Transfer
Target Sentence
Generation
Syntax
Planning
Morphological
Synthesis
(WordPhrase
Translation ) (Word form
Generation) (Sequence)
Jan 6 2014 ISI Kolkata ML for MT 26
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Generation Architecture Deconversion = Transfer + Generation
Jan 6 2014 ISI Kolkata ML for MT 27
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Transfer Based MT
Marathi-Hindi
Jan 6 2014 ISI Kolkata ML for MT 28
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Indian Language to Indian Language Machine Translation (ILILMT)
Bidirectional Machine Translation System
Developed for nine Indian language pairs
Approach
Transfer based
Modules developed using both rule based and statistical approach
Jan 6 2014 ISI Kolkata ML for MT 29
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Architecture of ILILMT System
Morphological Analyzer
Source Text
POS Tagger
Chunker
Vibhakti Computation
Name Entity Recognizer
Word Sense Disambiguatio
n
Lexical Transfer
Agreement Feature
Interchunk
Word Generator
Intrachunk
Target Text
Analysis
Transfer
Generation
Jan 6 2014 ISI Kolkata ML for MT 30
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
M-H MT system Evaluation
Subjective evaluation based on machine translation quality
Accuracy calculated based on score given by linguists
S5 Number of score 5 Sentences S4 Number of score 4 sentences
S3 Number of score 3 sentences N Total Number of sentences
Accuracy =
Score 5 Correct Translation
Score 4 Understandable with
minor errors
Score 3 Understandable with
major errors
Score 2 Not Understandable
Score 1 Non sense translation
Jan 6 2014 ISI Kolkata ML for MT 31
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Evaluation of Marathi to Hindi MT System
0
02
04
06
08
1
12
MorphAnalyzer
POS Tagger Chunker VibhaktiCompute
WSD LexicalTransfer
WordGenerator
Precision
Recall
Module-wise precision and recall
Jan 6 2014 ISI Kolkata ML for MT 32
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Evaluation of Marathi to Hindi MT System (cont)
Subjective evaluation on translation quality Evaluated on 500 web sentences Accuracy calculated based on score given according to the
translation quality Accuracy 6532
Result analysis
Morph POS tagger chunker gives more than 90 precision but Transfer WSD generator modules are below 80 hence degrades MT quality
Also morph disambiguation parsing transfer grammar and FW disambiguation modules are required to improve accuracy
Jan 6 2014 ISI Kolkata ML for MT 33
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Statistical Machine Translation
Jan 6 2014 ISI Kolkata ML for MT 34
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Czeck-English data
[nesu] ldquoI carryrdquo
[ponese] ldquoHe will carryrdquo
[nese] ldquoHe carriesrdquo
[nesou] ldquoThey carryrdquo
[yedu] ldquoI driverdquo
[plavou] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 35
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
To translate hellip
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 36
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Hindi-English data
[DhotA huM] ldquoI carryrdquo
[DhoegA] ldquoHe will carryrdquo
[DhotA hAi] ldquoHe carriesrdquo
[Dhote hAi] ldquoThey carryrdquo
[chalAtA huM] ldquoI driverdquo
[tErte hEM] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 37
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Bangla-English data
[bai] ldquoI carryrdquo
[baibe] ldquoHe will carryrdquo
[bay] ldquoHe carriesrdquo
[bay] ldquoThey carryrdquo
[chAlAi] ldquoI driverdquo
[sAMtrAy] ldquoThey swimrdquo
Jan 6 2014 ISI Kolkata ML for MT 38
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
To translate hellip (repeated)
I will carry
They drive
He swims
They will drive
Jan 6 2014 ISI Kolkata ML for MT 39
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Foundation
Data driven approach Goal is to find out the English sentence e
given foreign language sentence f whose p(e|f) is maximum
Translations are generated on the basis of statistical model
Parameters are estimated using bilingual parallel corpora
Jan 6 2014 ISI Kolkata ML for MT 40
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
SMT Language Model
To detect good English sentences
Probability of an English sentence w1w2 helliphellip wn can be written as
Pr(w1w2 helliphellip wn) = Pr(w1) Pr(w2|w1) Pr(wn|w1 w2 wn-1)
Here Pr(wn|w1 w2 wn-1) is the probability that word wn follows word string w1 w2 wn-1 N-gram model probability
Trigram model probability calculation
Jan 6 2014 ISI Kolkata ML for MT 41
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
SMT Translation Model
P(f|e) Probability of some f given hypothesis English translation e
How to assign the values to p(e|f)
Sentences are infinite not possible to find pair(ef) for all sentences
Introduce a hidden variable a that represents alignments between the individual words in the sentence pair
Sentence level
Word level
Jan 6 2014 ISI Kolkata ML for MT 42
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Alignment
If the string e= e1l= e1 e2 hellipel has l words and
the string f= f1m=f1f2fm has m words
then the alignment a can be represented by a series a1
m= a1a2am of m values each between 0 and l such that if the word in position j of the f-string is connected to the word in position i of the e-string then aj= i and
if it is not connected to any English word then aj= O
Jan 6 2014 ISI Kolkata ML for MT 43
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Example of alignment
English Ram went to school
Hindi Raama paathashaalaa gayaa
Ram went to school
ltNullgt Raama paathashaalaa gayaa
Jan 6 2014 ISI Kolkata ML for MT 44
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Translation Model Exact expression
Five models for estimating parameters in the expression [2]
Model-1 Model-2 Model-3 Model-4 Model-5
Choose alignment given e and m
Choose the identity of foreign word given e m a
Choose the length of foreign language string given e
Jan 6 2014 ISI Kolkata ML for MT 45
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
a
eafef )|Pr()|Pr(
m
emafeaf )|Pr()|Pr(
m
emafememaf )|Pr()|Pr()|Pr(
m
emafem )|Pr()|Pr(
m
m
j
jj
jj emfaafem1
1
1
1
1 )|Pr()|Pr(
m
j
jj
j
jj
j
m
emfafemfaaem1
1
11
1
1
1
1 )|Pr()|Pr()|Pr(
)|Pr( emaf )|Pr( em
m
j
jj
j
jj
j emfafemfaa1
1
11
1
1
1
1 )|Pr()|Pr(
Proof of Translation Model Exact
expression
m is fixed for a particular f hence
marginalization
marginalization
Jan 6 2014 ISI Kolkata ML for MT 46
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Alignment
Jan 6 2014 ISI Kolkata ML for MT 47
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Fundamental and ubiquitous
Spell checking
Translation
Transliteration
Speech to text
Text to speeh
Jan 6 2014 ISI Kolkata ML for MT 48
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
EM for word alignment from
sentence alignment example
English
(1) three rabbits
a b
(2) rabbits of Grenoble
b c d
French
(1) trois lapins
w x
(2) lapins de Grenoble
x y z
Jan 6 2014 ISI Kolkata ML for MT 49
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Initial Probabilities
each cell denotes t(a w) t(a x) etc
a b c d
w 14 14 14 14
x 14 14 14 14
y 14 14 14 14
z 14 14 14 14
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
The counts in IBM Model 1
Works by maximizing P(f|e) over the entire corpus
For IBM Model 1 we get the following relationship
c (w f |w e f e ) =t (w f |w e )
t (w f |we
0 ) +hellip + t (w f |wel )
c (w f |w e f e ) is the fractional count of the alignment of w f
with w e in f and e
t (w f |w e ) is the probability of w f being the translation of w e
is the count of w f in f
is the count of w e in e
Jan 6 2014 ISI Kolkata ML for MT 51
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Example of expected count
C[aw (a b)(w x)]
t(aw)
= ------------------------- X (a in lsquoa brsquo) X (w in lsquow xrsquo)
t(aw)+t(ax)
14
= ----------------- X 1 X 1= 12
14+14
Jan 6 2014 ISI Kolkata ML for MT 52
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
ldquocountsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 13 13 13
y 0 13 13 13
z 0 13 13 13
a b
w x
a b c d
w 12 12 0 0
x 12 12 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 53
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Revised probability example
trevised(a w)
12
= -------------------------------------------------------------------
(12+12 +0+0 )(a b)( w x) +(0+0+0+0 )(b c d) (x y z)
Jan 6 2014 ISI Kolkata ML for MT 54
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Revised probabilities table
a b c d
w 12 14 0 0
x 12 512 13 13
y 0 16 13 13
z 0 16 13 13
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
ldquorevised countsrdquo b c d
x y z
a b c d
w 0 0 0 0
x 0 59 13 13
y 0 29 13 13
z 0 29 13 13
a b
w x
a b c d
w 12 38 0 0
x 12 58 0 0
y 0 0 0 0
z 0 0 0 0
Jan 6 2014 ISI Kolkata ML for MT 56
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Re-Revised probabilities table a b c d
w 12 316 0 0
x 12 85144 13 13
y 0 19 13 13
z 0 19 13 13
Continue until convergence notice that (bx) binding gets progressively stronger
b=rabbits x=lapins
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Derivation of EM based Alignment Expressions
Hindi)(Say language ofy vocabular
English)(Say language ofry vocalbula
2
1
LV
LV
F
E
what is in a name नाम म कया ह naam meM kya hai name in what is what is in a name That which we call rose by any other name will smell as sweet
जजस हम गऱाब कहत ह और भी ककसी नाम स उसकी कशब सामान मीठा होगी Jise hum gulab kahte hai aur bhi kisi naam se uski khushbu samaan mitha hogii That which we rose say any other name by its smell as sweet That which we call rose by any other name will smell as sweet
E1
F1
E2
F2
Jan 6 2014 ISI Kolkata ML for MT 58
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Vocabulary mapping
Vocabulary VE VF
what is in a name that which we call rose by any other will smell as sweet
naam meM kya hai jise hum gulab kahte hai aur bhi kisi bhi uski khushbu saman mitha hogii
Jan 6 2014 ISI Kolkata ML for MT 59
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Key Notations English vocabulary 119881119864 French vocabulary 119881119865 No of observations sentence pairs 119878 Data 119863 which consists of 119878 observations looks like
11989011 11989012 hellip 119890
11198971 119891
11 11989112 hellip 119891
11198981
11989021 11989022 hellip 119890
21198972 119891
21 11989122 hellip 119891
21198982
1198901199041 119890
1199042 hellip 119890
119904119897119904 119891
1199041 1198911199042 hellip 119891
119904119898119904
1198901198781 119890
1198782 hellip 119890
119878119897119878 119891
1198781 1198911198782 hellip 119891
119878119898119878
No words on English side in 119904119905119893 sentence 119897119904 No words on French side in 119904119905119893 sentence 119898119904 119894119899119889119890119909119864 119890
119904119901 =Index of English word 119890119904119901in English vocabularydictionary
119894119899119889119890119909119865 119891119904119902 =Index of French word 119891119904119902in French vocabularydictionary
(Thanks to Sachin Pawar for helping with the maths formulae processing)
Jan 6 2014 ISI Kolkata ML for MT 60
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Hidden variables and parameters
Hidden Variables (Z) Total no of hidden variables = 119897119904 119898119904119878
119904=1 where each hidden variable is
as follows 119911119901119902119904 = 1 if in 119904119905119893 sentence 119901119905119893 English word is mapped to 119902119905119893 French
word 119911119901119902119904 = 0 otherwise
Parameters (Θ) Total no of parameters = 119881119864 times 119881119865 where each parameter is as follows 119875119894119895 = Probability that 119894119905119893 word in English vocabulary is mapped to 119895119905119893
word in French vocabulary
Jan 6 2014 ISI Kolkata ML for MT 61
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Likelihoods Data Likelihood L(D Θ)
Data Log-Likelihood LL(D Θ)
Expected value of Data Log-Likelihood E(LL(D Θ))
Jan 6 2014 ISI Kolkata ML for MT 62
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Constraint and Lagrangian
119875119894119895
119881119865
119895=1
= 1 forall119894
Jan 6 2014 ISI Kolkata ML for MT 63
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Differentiating wrt Pij
Jan 6 2014 ISI Kolkata ML for MT 64
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Final E and M steps
M-step
E-step
Jan 6 2014 ISI Kolkata ML for MT 65
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Combinatorial considerations
Jan 6 2014 ISI Kolkata ML for MT 66
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Example
ISI Kolkata ML for MT Jan 6 2014 67
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
All possible alignments
ISI Kolkata ML for MT Jan 6 2014 68
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
First fundamental requirement of SMT
Alignment requires evidence of
bull firstly a translation pair to introduce the
POSSIBILITY of a mapping
bull then another pair to establish with
CERTAINTY the mapping
ISI Kolkata ML for MT Jan 6 2014 69
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
For the ldquocertaintyrdquo
We have a translation pair containing alignment candidates and none of the other words in the translation pair
OR
We have a translation pair containing all words in the translation pair except the alignment candidates
ISI Kolkata ML for MT Jan 6 2014 70
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Thereforehellip
If M valid bilingual mappings exist in a translation pair then an additional M-1 pairs of translations will decide these mappings with certainty
ISI Kolkata ML for MT Jan 6 2014 71
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Rough estimate of data requirement
SMT system between two languages L1 and L2 Assume no a-priori linguistic or world knowledge
ie no meanings or grammatical properties of any words phrases or sentences
Each language has a vocabulary of 100000 words can give rise to about 500000 word forms
through various morphological processes assuming each word appearing in 5 different forms on the average For example the word bdquogo‟ appearing in bdquogo‟ bdquogoing‟
bdquowent‟ and bdquogone‟
ISI Kolkata ML for MT Jan 6 2014 72
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Reasons for mapping to multiple words
Synonymy on the target side (eg ldquoto gordquo in English translating to ldquojaanaardquo ldquogaman karnaardquo ldquochalnaardquo etc in Hindi) a phenomenon called lexical choice or register
polysemy on the source side (eg ldquoto gordquo translating to ldquoho jaanaardquo as in ldquoher face went red in angerrdquordquousakaa cheharaa gusse se laal ho gayaardquo)
syncretism (ldquowentrdquo translating to ldquogayaardquo ldquogayiirdquo or ldquogayerdquo) Masculine Gender 1st or 3rd person singular number past tense non-progressive aspect declarative mood
ISI Kolkata ML for MT Jan 6 2014 73
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Estimate of corpora requirement
Assume that on an average a sentence is 10 words long
an additional 9 translation pairs for getting at one of the 5 mappings
10 sentences per mapping per word
a first approximation puts the data requirement at 5 X 10 X 500000= 25 million parallel sentences
Estimate is not wide off the mark
Successful SMT systems like Google and Bing reportedly use 100s of millions of translation pairs
ISI Kolkata ML for MT Jan 6 2014 74
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Our work on factor based SMT
Ananthakrishnan Ramanathan Hansraj Choudhary Avishek Ghosh and Pushpak Bhattacharyya Case markers and Morphology Addressing the crux of the fluency problem in English-Hindi SMT ACL-IJCNLP 2009 Singapore August 2009
Jan 6 2014 ISI Kolkata ML for MT 75
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Case Marker and Morphology crucial in E-H MT
Order of magnitiude facelift in Fluency and fidelity
Determined by the combination of suffixes and semantic relations on the English side
Augment the aligned corpus of the two languages with the correspondence of English suffixes and semantic relations with Hindi suffixes and case markers
Jan 6 2014 ISI Kolkata ML for MT 76
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Semantic relations+SuffixesCase
Markers+inflections
I ate mangoes
I ltagt ate eatpast mangoes ltobj
I ltagt mangoes ltobjpl eatpast
mei_ne aam khaa_yaa
Jan 6 2014 ISI Kolkata ML for MT 77
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Our Approach
Factored model (Koehn and Hoang 2007) with the following translation factor
suffix + semantic relation case markersuffix
Experiments with the following relations
Dependency relations from the stanford parser
Deeper semantic roles from Universal Networking Language (UNL)
Jan 6 2014 ISI Kolkata ML for MT 78
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Our Factorization
Jan 6 2014 ISI Kolkata ML for MT 79
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Experiments
Jan 6 2014 ISI Kolkata ML for MT 80
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Corpus Statistics
Jan 6 2014 ISI Kolkata ML for MT 81
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Results The impact of suffix and semantic factors
Jan 6 2014 ISI Kolkata ML for MT 82
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Results The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 83
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Subjective Evaluation The impact of reordering and semantic relations
Jan 6 2014 ISI Kolkata ML for MT 84
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Impact of sentence length (F Fluency AAdequacy E Errors)
Jan 6 2014 ISI Kolkata ML for MT 85
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
A feel for the improvement-baseline
Jan 6 2014 ISI Kolkata ML for MT 86
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
A feel for the improvement-reorder
Jan 6 2014 ISI Kolkata ML for MT 87
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
A feel for the improvement-Semantic relation
Jan 6 2014 ISI Kolkata ML for MT 88
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
A recent study
PAN Indian SMT
Jan 6 2014 ISI Kolkata ML for MT 89
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Pan-Indian Language SMT httpwwwcfiltiitbacinindic-translator
SMT systems between 11 languages 7 Indo-Aryan Hindi Gujarati Bengali Oriya Punjabi
Marathi Konkani 3 Dravidian languages Malayalam Tamil Telugu English
Corpus Indian Language Corpora Initiative (ILCI) Corpus Tourism and Health Domains 50000 parallel sentences
Evaluation with BLEU METEOR scores also show high correlation with BLEU
Jan 6 2014 ISI Kolkata ML for MT 90
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
SMT Systems Trained
Phrase-based (PBSMT) baseline system (S1)
E-IL PBSMT with Source side reordering rules (Ramanathan et al 2008) (S2)
E-IL PBSMT with Source side reordering rules (Patel et al 2013) (S3)
IL-IL PBSMT with transliteration post-editing (S4)
Jan 6 2014 ISI Kolkata ML for MT 91
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Natural Partitioning of SMT systems
Clear partitioning of translation pairs by language family pairs based on translation accuracy
Shared characteristics within language families make translation simpler
Divergences among language families make translation difficult
Baseline PBSMT - BLEU scores (S1)
Jan 6 2014 ISI Kolkata ML for MT 92
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
The Challenge of Morphology Morphological complexity vs BLEU
Training Corpus size vs BLEU
Vocabulary size is a proxy for morphological complexity Note For Tamil a smaller corpus was used for computing vocab size
bull Translation accuracy decreases with increasing morphology
bull Even if training corpus is increased commensurate improvement in translation accuracy is not seen for morphologically rich languages
bull Handling morphology in SMT is critical Jan 6 2014 ISI Kolkata ML for MT 93
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Common Divergences Shared Solutions
All Indian languages have similar word order The same structural divergence between English and Indian
languages SOVlt-gtSVO etc Common source side reordering rules improve E-IL
translation by 114 (generic) and 186 (Hindi-adapted) Common divergences can be handled in a common
framework in SMT systems ( This idea has been used for knowledge based MT systems eg Anglabharati )
Comparison of source reordering methods for E-IL SMT - BLEU scores (S1S2S3)
Jan 6 2014 ISI Kolkata ML for MT 94
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Harnessing Shared Characteristics
Out of Vocabulary words are transliterated in a post-editing step Done using a simple transliteration scheme which harnesses the common
phonetic organization of Indic scripts Accuracy Improvements of 05 BLEU points with this simple approach
Harnessing common characteristics can improve SMT output
PBSMT+ transliteration post-editing for E-IL SMT - BLEU scores (S4)
Jan 6 2014 ISI Kolkata ML for MT 95
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Cognition and Translation Measuring Translation Difficulty
Abhijit Mishra and Pushpak Bhattacharyya Automatically Predicting Sentence Translation Difficulty ACL 2013 Sofia Bulgaria 4-9 August 2013
96 Jan 6 2014 ISI Kolkata ML for MT
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Scenario
Sentences
John ate jam
John ate jam made from apples
John is in a jam
Subjective notion of difficulty
Easy
Moderate
Difficult
Jan 6 2014 ISI Kolkata ML for MT 97
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Use behavioural data
Use behavioural data to decipher strong AI algorithms
Specifically
For WSD by humans see where the eye rests for clues
For the innate translation difficulty of sentences see how the eye moves back and forth over the sentences
Jan 6 2014 ISI Kolkata ML for MT 98
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Image Courtesy httpwwwsmashingmagazinecom2007100930-usability-issues-to-be-aware-of
Fixations
Saccades
Jan 6 2014 ISI Kolkata ML for MT 99
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Eye Tracking data Gaze points Position of eye-gaze on the
screen
Fixations A long stay of the gaze on a particular object on the screen
Fixations have both Spatial (coordinates) and Temporal (duration) properties
Saccade A very rapid movement of eye between the positions of rest
Scanpath A path connecting a series of fixations
Regression Revisiting a previously read segment
Jan 6 2014 ISI Kolkata ML for MT 100
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Controlling the experimental setup for eye-tracking
Eye movement patterns influenced by factors like age working proficiency environmental distractions etc
Guidelines for eye tracking Participants metadata (age expertise occupation)
etc
Performing a fresh calibration before each new experiment
Minimizing the head movement
Introduce adequate line spacing in the text and avoid scrolling
Carrying out the experiments in a relatively low light environment
Jan 6 2014 ISI Kolkata ML for MT 101
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Use of eye tracking
Used extensively in Psychology
Mainly to study reading processes
Seminal work Just MA and Carpenter PA (1980) A theory of reading from eye fixations to comprehension Psychological Review 87(4)329ndash354
Used in flight simulators for pilot training
Jan 6 2014 ISI Kolkata ML for MT 102
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
NLP and Eye Tracking research
Kliegl (2011)- Predict word frequency
and pattern from eye movements Doherty et al (2010)- Eye-tracking as an
automatic Machine Translation Evaluation Technique
Stymne et al (2012)- Eye-tracking as a tool for Machine Translation (MT) error analysis
Dragsted (2010)- Co-ordination of reading and writing process during translation
Relatively new and open research direction
Jan 6 2014 ISI Kolkata ML for MT 103
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Translation Difficulty Index (TDI)
Motivation route sentences to translators with right competence as per difficulty of translating
On a crowdsourcing platform eg
TDI is a function of
sentence length (l)
degree of polysemy of constituent words (p) and
structural complexity (s)
104 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Contributor to TDI length
What is more difficult to translate
John eats jam vs
John eats jam made from apples vs
John eats jam made from apples grown in orchards vs
John eats bread made from apples grown in orchards on black soil
105 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Contributor to TDI polysemy
What is more difficult to translate
John is in a jam vs
John is in difficulty
Jam has 4 diverse senses difficulty has 4 related senses
106 Jan 6 2014 ISI Kolkata ML for MT
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Contributor to TDI structural complexity
What is more difficult to translate
John is in a jam His debt is huge The lenders cause him to shy from them every moment he sees them vs
John is in a jam caused by his huge debt which forces him to shy from his lenders every moment he sees them
107 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Measuring translation through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
108 Jan 6 2014 ISI Kolkata ML for MT
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Measuring translation difficulty through Gaze data
Translation difficulty indicated by
staying of eye on segments
Jumping back and forth between segments
Example
The horse raced past the garden fell
बगीचा क पास स दौडाया गया घोड़ा गगर गया bagiichaa ke pas se doudaayaa gayaa ghodaa
gir gayaa
The translation process will complete the task till garden and then backtrack revise restart and translate in a different way 109 Jan 6 2014 ISI Kolkata ML for MT
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Scanpaths indicator of translation difficulty
(Malsburg et al 2007)
Sentence 2 is a clear case of ldquoGarden pathingrdquo
which imposes cognitive load on participants and the prefer syntactic re-analysis
Jan 6 2014 ISI Kolkata ML for MT 110
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Translog A tool for recording Translation Process Data Translog (Carl 2012) A Windows based program
Built with a purpose of recording gaze and key-stroke data during translation
Can be used for other reading and writing related studies
Using Translog one can Create and Customize translationreading and writing
experiments involving eye-tracking and keystroke logging
Calibrate the eye-tracker
Replay and analyze the recorded log files
Manually correct errors in gaze recording
Jan 6 2014 ISI Kolkata ML for MT 111
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
TPR Database The Translation Process Research (TPR) database
(Carl 2012) is a database containing behavioral data for translation activities
Contains Gaze and Keystroke information for more than 450 experiments
40 different paragraphs are translated into 7 different languages from English by multiple translators
At least 5 translators per language
Source and target paragraphs are annotated with POS tags lemmas dependency relations etc
Easy to use XML data format
Jan 6 2014 ISI Kolkata ML for MT 112
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Experimental setup (12)
Translators translate sentence by sentence typing to a text box
The display screen is attached with a remote eye-tracker which
constantly records the eye movement of the translator
113 Jan 6 2014 ISI Kolkata ML for MT
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Experimental setup (22)
Extracted 20 different text categories from the data
Each piece of text contains 5-10 sentences
For each category we had at least 10 participants who translated the text into different target languages
114 Jan 6 2014 ISI Kolkata ML for MT
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
A predictive framework for TDI
Direct annotation of TDI is fraught with subjectivity and ad-hocism
We use translator‟s gaze data as annotation to prepare training data
Training data
Regressor
Labeling through gaze analysis Features
Test Data
TDI
Jan 6 2014 ISI Kolkata ML for MT 115
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Annotation of TDI (14)
First approximation -gt TDI equivalent to ldquotime taken to translaterdquo
However time taken to translate may not be strongly related to translation difficulty
It is difficult to know what fraction of the total time is spent on translation related thinking
Sensitive to distractions from the environment
Jan 6 2014 ISI Kolkata ML for MT 116
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Annotation of TDI (24)
Instead of the ldquotime taken to translaterdquo consider ldquotime for which translation related processing is carried out by the brainrdquo
This is called Translation Processing Time given by
119879119901 = 119879119888119900119898119901+119879119892119890119899
Tcomp and Tgen are the comprehension of source text comprehension and target text generation respectively
Jan 6 2014 ISI Kolkata ML for MT 117
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Annotation of TDI (34)
Humans spend time on what they see and this ldquotimerdquo is correlated with the complexity of the information being processed
f- fixation s- saccade Fs- source Ft- target
119879119901 = 119889119906119903 119891 +
119891 isin 119865119904
119889119906119903 119904 +
119904 isin 119878119904
119889119906119903 119891 +
119891 isin 119865119905
119889119906119903 119904
119904 isin 119878119905
Jan 6 2014 ISI Kolkata ML for MT 118
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Annotation of TDI (44)
The measured TDI score is the Tp normalized over sentence length
119879119863119868119898119890119886119904119906119903119890119889 =119879119901
119904119890119899119905119890119899119888119890_119897119890119899119892119905119893
Jan 6 2014 ISI Kolkata ML for MT 119
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Features
Length Word count of the sentences
Degree of Polysemy Sum of number of senses of each word in the WordNet normalized by length
Structural Complexity If the attachment units lie far from each other the sentence has higher structural complexity Lin (1996) defines it as the total length of dependency links in the dependency structure of the sentence
Measured TDI for TPR database for 80 sentences
Jan 6 2014 ISI Kolkata ML for MT 120
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Experiment and results
Training data of 80 examples 10-fold cross validation
Features computed using Princeton WordNet and Stanford Dependency Parser
Support Vector Regression technique (Joachims et al 1999) along with different kernels
Error analysis was done by Mean Squared Error estimate
We also computed the correlation of the predicted TDI with the measured TDI
Jan 6 2014 ISI Kolkata ML for MT 121
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Examples from the dataset
Jan 6 2014 ISI Kolkata ML for MT 122
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Summary
Covered Interlingual based MT the oldest approach to MT
Covered SMT the newest approach to MT
Presented some recent study in the context of Indian Languages
Covered eye tracking bring cognitive aspects to MT
123 Jan 6 2014 ISI Kolkata ML for MT
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Summary
SMT is the ruling paradigm
But linguistic features can enhance performance especially the factored based SMT with factors coming from interlingua
Large scale effort sponsored by ministry of IT TDIL program to create MT systems
Parallel corpora creation is also going on in a consortium mode
Jan 6 2014 ISI Kolkata ML for MT 124
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Conclusions NLP has assumed great importance because
of large amount of text in e-form
Machine learning techniques are increasingly applied
Highly relevant for India where multilinguality is way of life
Machine Translation is more fundamental and ubiquitous than just mapping between two languages
Utterancethought
Speech to speech online translation Jan 6 2014 ISI Kolkata ML for MT 125
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126
Pubs httpwwcseiitbacin~pb
Resources and tools
httpwwwcfiltiitbacin
Jan 6 2014 ISI Kolkata ML for MT 126