Parallel Reverse Treebanks for the Discovery of Morpho-
Syntactic Markings
Lori LevinRobert Frederking
Alison Alvarez
Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University
Jeff Good
Department of LinguisticsMax Planck Institute for Evolutionary
Anthropology
Reverse Treebank (RTB)
• What?– Create the syntactic structures first– Then add sentences
• Why?– To elicit data from speakers of less commonly taught
languages:• Decide what meaning we want to elicit• Represent the meaning in a feature structure• Add an English or Spanish sentence (plus context notes) to
express the meaning• Ask the informant to translate it
Bengali Examplesrcsent: The large bus to the post office broke down. context: tgtsent:
((actor ((modifier ((mod-role mod-descriptor)(mod-role role-loc-general-to))) (np-identifiability identifiable)(np-specificity specific)(np-biological-gender bio-gender-n/a)(np-animacy anim-inanimate)(np-person person-third)(np-function fn-actor)(np-general-type common-noun-type)(np-number num-sg)(np-pronoun-exclusivity inclusivity-n/a)(np-pronoun-antecedent antecedent-n/a)(np-distance distance-neutral)))
(c-general-type declarative-clause)(c-my-causer-intentionality intentionality-n/a)(c-comparison-type comparison-n/a)(c-relative-tense relative-n/a)(c-our-boundary boundary-n/a)(c-comparator-function comparator-n/a)(c-causee-control control-n/a)(c-our-situations situations-n/a)(c-comparand-type comparand-n/a)(c-causation-directness directness-n/a)(c-source source-neutral)(c-causee-volitionality volition-n/a)(c-assertiveness assertiveness-neutral)(c-solidarity solidarity-neutral)(c-polarity polarity-positive)(c-v-grammatical-aspect gram-aspect-neutral)(c-adjunct-clause-type adjunct-clause-type-n/a)(c-v-phase-aspect phase-aspect-neutral)(c-v-lexical-aspect activity-accomplishment)(c-secondary-type secondary-neutral)(c-event-modality event-modality-none)(c-function fn-main-clause)(c-minor-type minor-n/a)(c-copula-type copula-n/a)(c-v-absolute-tense past)(c-power-relationship power-peer)(c-our-shared-subject shared-subject-n/a)(c-question-gap gap-n/a))
Outline
Background– The AVENUE Machine Translation System
• Contents of the RTB– An inventory of grammatical meanings– Languages that have been elicited
• Tools for RTB creation• Future work
– Evaluation – Navigation
AVENUE Machine Translation System
Type informationSynchronous Context Free
RulesAlignments
x-side constraints
y-side constraints
xy-constraints, e.g. ((Y1 AGR) = (X1 AGR))
;SL: the old man, TL: ha-ish ha-zaqen
NP::NP [DET ADJ N] -> [DET N DET ADJ]((X1::Y1)(X1::Y3)(X2::Y4)(X3::Y2)
((X1 AGR) = *3-SING)((X1 DEF = *DEF)((X3 AGR) = *3-SING)((X3 COUNT) = +)
((Y1 DEF) = *DEF)((Y3 DEF) = *DEF)((Y2 AGR) = *3-SING)((Y2 GENDER) = (Y4 GENDER)))
Jaime Carbonell (PI), Alon Lavie (Co-PI), Lori Levin (Co-PI)
Rule learning: Katharina Probst
AVENUE
• Rules can be written by hand or learned automatically.
• Hybrid– Rule-based transfer– Statistical decoder– Multi-engine combinations with SMT and EBMT
AVENUE systems(Small and experimental, but tested on unseen data)
• Hebrew-to-English – Alon Lavie, Shuly Wintner, Katharina Probst– Hand-written and automatically learned– Automatic rules trained on 120 sentences perform
slightly better than about 20 hand-written rules.
• Hindi-to-English – Lavie, Peterson, Probst, Levin, Font, Cohen, Monson– Automatically learned– Performs better than SMT when training data is limited
to 50K words
AVENUE systems(Small and experimental, but tested on unseen data)
• English-to-Spanish– Ariadna Font Llitjos– Hand-written, automatically corrected
• Mapudungun-to-Spanish – Roberto Aranovich and Christian Monson– Hand-written
• Dutch-to-English – Simon Zwarts– Hand-written
Elicitation
• Get data from someone who is– Bilingual – Literate– Not experienced with linguistics
English-Hindi Example
Elicitation Tool: Erik Peterson
English-Chinese Example
English-Arabic Example
Elicitation
srcsent: Tú caístetgtsent: eymi ütrünagimialigned: ((1,1),(2,2))context: tú = Juan [masculino, 2a persona del singular]comment: You (John) fell
srcsent: Tú estás cayendotgtsent: eymi petu ütrünagimialigned: ((1,1),(2 3,2 3))context: tú = Juan [masculino, 2a persona del singular]comment: You (John) are falling
srcsent: Tú caíste tgtsent: eymi ütrunagimialigned: ((1,1),(2,2))context: tú = María [femenino, 2a persona del singular]comment: You (Mary) fell
Outline
• Background– The AVENUE Machine Translation System
Contents of the RTB– An inventory of grammatical meanings– Languages that have been elicited
• Tools for RTB creation• Future work
– Evaluation – Navigation
Size of RTB
• Around 3200 sentences
• 20K words
Languages
• The set of feature structures with English sentences has been delivered to the Linguistic Data Consortium as part of the Reflex program.
• Translated (by LDC) into:– Thai– Bengali
• Plans to translate into:– Seven “strategic” languages per year for five years.
• As one small part of a language pack (BLARK) for each language.
Languages
• Feature structures are being reverse annotated in Spanish at New Mexico State University (Helmreich and Cowie)– Plans to translate into Guarani
• Reverse annotation into Portuguese in Brazil (Marcello Modesto)– Plans to translate into Karitiana
• 200 speakers
• Plans to translate into Inupiaq (Kaplan and MacLean)
Previous Elicitation Work
• Pilot corpus– Around 900 sentences– No feature structures
• Mapudungun– Two partial translations
• Quechua– Three translations
• Aymara– Seven translations
• Hebrew• Hindi
– Several translations• Dutch
Sample: clause level• Mary is writing a book for John.• Who let him eat the sandwich?• Who had the machine crush the
car?• They did not make the policeman
run.• Mary had not blinked.• The policewoman was willing to
chase the boy.• Our brothers did not destroy files.• He said that there is not a manual.• The teacher who wrote a textbook
left.• The policeman chased the man
who was a thief.• Mary began to work.
• Tense, aspect, transitivity• Questions, causation and
permission
• Interaction of lexical and grammatical aspect
• Volitionality
• Embedded clauses and sequence of tense
• Relative clauses
• Phase aspect
Sample: noun phrase level• The man quit in November.• The man works in the
afternoon.• The balloon floated over the
library.• The man walked over the
platform.• The man came out from
among the group of boys.• The long weekly meeting
ended.• The large bus to the post office
broke down.• The second man laughed.• All five boys laughed.
• Temporal and locative meanings
• Quantifiers• Numbers• Combinations of different types
of modifers– My book
• Possession, definiteness– A book of mine
• Possession, indefiniteness
Examplesrcsent: The large bus to the post office broke down.
((actor ((modifier ((mod-role mod-descriptor)(mod-role role-loc-general-to))) (np-identifiability identifiable)(np-specificity specific)(np-biological-gender bio-gender-n/a)(np-animacy anim-inanimate)(np-person person-third)(np-function fn-actor)(np-general-type common-noun-type)(np-number num-sg)(np-pronoun-exclusivity inclusivity-n/a)(np-pronoun-antecedent antecedent-n/a)(np-distance distance-neutral)))
(c-general-type declarative-clause)(c-my-causer-intentionality intentionality-n/a)(c-comparison-type comparison-n/a)(c-relative-tense relative-n/a)(c-our-boundary boundary-n/a)(c-comparator-function comparator-n/a)(c-causee-control control-n/a)(c-our-situations situations-n/a)(c-comparand-type comparand-n/a)(c-causation-directness directness-n/a)(c-source source-neutral)(c-causee-volitionality volition-n/a)(c-assertiveness assertiveness-neutral)(c-solidarity solidarity-neutral)(c-polarity polarity-positive)(c-v-grammatical-aspect gram-aspect-neutral)(c-adjunct-clause-type adjunct-clause-type-n/a)(c-v-phase-aspect phase-aspect-neutral)(c-v-lexical-aspect activity-accomplishment)(c-secondary-type secondary-neutral)(c-event-modality event-modality-none)(c-function fn-main-clause)(c-minor-type minor-n/a)(c-copula-type copula-n/a)(c-v-absolute-tense past)(c-power-relationship power-peer)(c-our-shared-subject shared-subject-n/a)(c-question-gap gap-n/a))
Grammatical meanings vs syntactic categories
• Features and values are based on a collection of grammatical meanings– Many of which are similar to the
grammatemes of the Prague Treebanks
Grammatical Meanings
YES• Semantic Roles• Identifiability• Specificity• Time
– Before, after, or during time of speech
• Modality
NO• Case• Voice• Determiners• Auxiliary verbs
Grammatical Meanings
YES• How is identifiability
expressed?– Determiner– Word order– Optional case marker– Optional verb agreement
• How is specificity expressed?
• How are generics expressed?
• How are predicate nominals marked?
NO• How are English
determiners translated?– The boy cried.– The lion is a fierce beast.– I ate a sandwich.– He is a soldier.
• Il est soldat.
Argument Roles
• Actor– Roughly, deep subject
• Undergoer– Roughly, deep object
• Predicate and predicatee– The woman is the manager.
• Recipient– I gave a book to the students.
• Beneficiary– I made a phone call for Sam.
Why not subject and object?
• Languages use their voice systems for different purposes.
• Mapudungun obligatorily uses an inverse marked verb when third person acts on first or second person.– Verb agrees with undergoer– Undergoer exhibits other subjecthood properties– Actor may be object.
• Yes: How are actor and undergoer encoded in combination with other semantic features like adversity (Japanese) and person (Mapudungun)?
• No: How is English voice translated into another language?
Argument Roles
• Accompaniment– With someone– With pleasure
• Material– (out) of wood
• About 20 more roles – From the Lingua checklist; Comrie & Smith (1977)– Many also found in tectogrammatical representations
• Around 80 locative relations– From Lingua checklist
• Many temporal relations
Noun Phrase Features
• Person• Number• Biological gender• Animacy• Distance (for deictics)• Identifiability• Specificity• Possession• Other semantic roles
– Accompaniment, material, location, time, etc.
• Type– Proper, common, pronoun
• Cardinals• Ordinals• Quantifiers• Given and new
information– Not used yet because of
limited context in the elicitation tool.
Clause level features
• Tense• Aspect
– Lexical, grammatical, phase
• Type– Declarative, open-q,
yes-no-q
• Function– Main, argument,
adjunct, relative
• Source– Hearsay, first-hand,
sensory, assumed
• Assertedness– Asserted,
presupposed, wanted
• Modality– Permission, obligation– Internal, external
Other clause types(Constructions)
• Causative– Make/let/have someone do something
• Predication– May be expressed with or without an overt copula.
• Existential– There is a problem.
• Impersonal– One doesn’t smoke in restaurants in the US.
• Lament– If only I had read the paper.
• Conditional• Comparative• Etc.
Outline
• Background– The AVENUE Machine Translation System
• Contents of the RTB– An inventory of grammatical meanings– Languages that have been elicited
Tools for RTB creation• Future work
– Evaluation – Navigation
Tools for RTB Creation
• Change the inventory of grammatical meanings
• Make new RTBs for other purposes
Mar 1, 2006
The Process
List of semantic features and values
The Corpus
Feature Maps: which combinations of features and values are of interest
…Clause-Level
Noun-Phrase
Tense & Aspect Modality
Feature Structure Sets
Feature Specification
Reverse Annotated Feature Structure Sets: add English sentences
Smaller CorpusSampling
Feature Specification
• XML Schema• XSLT Script• Human readable form
– Feature: Causer intentionality• Values: intentional, unintentional
– Feature: Causee control• Values: in control, not in control
– Feature: Causee volitionality• Values: willing, unwilling
– Feature: Causation type• Values: direct, indirect
Feature Combination
• Person and number interact with tense in many fusional languages.
• In English, tense interacts with questions:– Will you go?
Feature Combination Template((predicatee((np-general-type pronoun-type common-
noun-type)(np-person person-first person-second
person-third)(np-number num-sg num-pl)(np-biological-gender bio-gender-male bio-
gender-female)))
{[(predicate ((np-general-type common-noun-type)
(np-person person-third)))(c-copula-type role)][(predicate ((adj-general-type quality-type)(c-copula-type attributive)))][(predicate ((np-general-type common-
noun-type)(np-person person-third)(c-copula-type identity)))]}
(c-secondary-type secondary-copula) (c-polarity #all)
(c-general-type declarative)(c-speech-act sp-act-state)(c-v-grammatical-aspect gram-aspect-
neutral)(c-v-lexical-aspect state)(c-v-absolute-tense past present future)(c-v-phase-aspect durative))
Summarizes 288 feature structures, which are automatically generated.
Annotation Tool
• Feature structure viewer– Various views of the feature structure
• Omit features whose value is not-applicable• Group related features together
– Aspect– causation
Outline
• Background– The AVENUE Machine Translation System
• Contents of the RTB– An inventory of grammatical meanings– Languages that have been elicited
• Tools for RTB creationFuture work
– Evaluation – Navigation
Evaluation
• Current funding has not covered evaluation of the RTB.– Except for informal observations as it was
translated into several languages.
• Does it elicit the meanings it was intended to elicit?– Informal observation: usually
• Is it useful for machine translation?
Hard Problems
• Reverse annotating meanings that are not grammaticalized in English.– Evidentiality:
• He stole the bread.• Context: Translate this as if you do not
have first hand knowledge. In English, we might say, “They say that he stole the bread” or “I hear that he stole the bread.”
Hard Problems
• Reverse annotating things that can be said in several ways in English.– Impersonals:
• One doesn’t smoke here.• You don’t smoke here.• They don’t smoke here.• Credit cards aren’t accepted.
– Problem in the Reflex corpus because space was limited.
Navigation
• Currently, feature combinations are specified by a human.
• Plan to work in active learning mode.– Build seed RTB– Translate some data– Do some learning– Identify most valuable pieces of information to get
next– Generate an RTB for those pieces of information– Translate more– Learn more– Generate more, etc.