parallel reverse treebanks for the discovery of morpho-syntactic markings

42
Parallel Reverse Treebanks for the Discovery of Morpho- Syntactic Markings Lori Levin Robert Frederking Alison Alvarez Language Technologies Institute School of Computer Science Carnegie Mellon University Jeff Good Department of Linguistics Max Planck Institute for Evolutionary Anthropology

Upload: niesha

Post on 31-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings. Lori Levin Robert Frederking Alison Alvarez Language Technologies Institute School of Computer Science Carnegie Mellon University. Jeff Good Department of Linguistics - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Parallel Reverse Treebanks for the Discovery of Morpho-

Syntactic Markings

Lori LevinRobert Frederking

Alison Alvarez

Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University

Jeff Good

Department of LinguisticsMax Planck Institute for Evolutionary

Anthropology

Page 2: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Reverse Treebank (RTB)

• What?– Create the syntactic structures first– Then add sentences

• Why?– To elicit data from speakers of less commonly taught

languages:• Decide what meaning we want to elicit• Represent the meaning in a feature structure• Add an English or Spanish sentence (plus context notes) to

express the meaning• Ask the informant to translate it

Page 3: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Bengali Examplesrcsent: The large bus to the post office broke down. context: tgtsent:

((actor ((modifier ((mod-role mod-descriptor)(mod-role role-loc-general-to))) (np-identifiability identifiable)(np-specificity specific)(np-biological-gender bio-gender-n/a)(np-animacy anim-inanimate)(np-person person-third)(np-function fn-actor)(np-general-type common-noun-type)(np-number num-sg)(np-pronoun-exclusivity inclusivity-n/a)(np-pronoun-antecedent antecedent-n/a)(np-distance distance-neutral)))

(c-general-type declarative-clause)(c-my-causer-intentionality intentionality-n/a)(c-comparison-type comparison-n/a)(c-relative-tense relative-n/a)(c-our-boundary boundary-n/a)(c-comparator-function comparator-n/a)(c-causee-control control-n/a)(c-our-situations situations-n/a)(c-comparand-type comparand-n/a)(c-causation-directness directness-n/a)(c-source source-neutral)(c-causee-volitionality volition-n/a)(c-assertiveness assertiveness-neutral)(c-solidarity solidarity-neutral)(c-polarity polarity-positive)(c-v-grammatical-aspect gram-aspect-neutral)(c-adjunct-clause-type adjunct-clause-type-n/a)(c-v-phase-aspect phase-aspect-neutral)(c-v-lexical-aspect activity-accomplishment)(c-secondary-type secondary-neutral)(c-event-modality event-modality-none)(c-function fn-main-clause)(c-minor-type minor-n/a)(c-copula-type copula-n/a)(c-v-absolute-tense past)(c-power-relationship power-peer)(c-our-shared-subject shared-subject-n/a)(c-question-gap gap-n/a))

Page 4: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Outline

Background– The AVENUE Machine Translation System

• Contents of the RTB– An inventory of grammatical meanings– Languages that have been elicited

• Tools for RTB creation• Future work

– Evaluation – Navigation

Page 5: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

AVENUE Machine Translation System

Type informationSynchronous Context Free

RulesAlignments

x-side constraints

y-side constraints

xy-constraints, e.g. ((Y1 AGR) = (X1 AGR))

;SL: the old man, TL: ha-ish ha-zaqen

NP::NP [DET ADJ N] -> [DET N DET ADJ]((X1::Y1)(X1::Y3)(X2::Y4)(X3::Y2)

((X1 AGR) = *3-SING)((X1 DEF = *DEF)((X3 AGR) = *3-SING)((X3 COUNT) = +)

((Y1 DEF) = *DEF)((Y3 DEF) = *DEF)((Y2 AGR) = *3-SING)((Y2 GENDER) = (Y4 GENDER)))

Jaime Carbonell (PI), Alon Lavie (Co-PI), Lori Levin (Co-PI)

Rule learning: Katharina Probst

Page 6: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

AVENUE

• Rules can be written by hand or learned automatically.

• Hybrid– Rule-based transfer– Statistical decoder– Multi-engine combinations with SMT and EBMT

Page 7: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

AVENUE systems(Small and experimental, but tested on unseen data)

• Hebrew-to-English – Alon Lavie, Shuly Wintner, Katharina Probst– Hand-written and automatically learned– Automatic rules trained on 120 sentences perform

slightly better than about 20 hand-written rules.

• Hindi-to-English – Lavie, Peterson, Probst, Levin, Font, Cohen, Monson– Automatically learned– Performs better than SMT when training data is limited

to 50K words

Page 8: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

AVENUE systems(Small and experimental, but tested on unseen data)

• English-to-Spanish– Ariadna Font Llitjos– Hand-written, automatically corrected

• Mapudungun-to-Spanish – Roberto Aranovich and Christian Monson– Hand-written

• Dutch-to-English – Simon Zwarts– Hand-written

Page 9: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Elicitation

• Get data from someone who is– Bilingual – Literate– Not experienced with linguistics

Page 10: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

English-Hindi Example

Elicitation Tool: Erik Peterson

Page 11: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

English-Chinese Example

Page 12: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

English-Arabic Example

Page 13: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Elicitation

srcsent: Tú caístetgtsent: eymi ütrünagimialigned: ((1,1),(2,2))context: tú = Juan [masculino, 2a persona del singular]comment: You (John) fell

srcsent: Tú estás cayendotgtsent: eymi petu ütrünagimialigned: ((1,1),(2 3,2 3))context: tú = Juan [masculino, 2a persona del singular]comment: You (John) are falling

srcsent: Tú caíste tgtsent: eymi ütrunagimialigned: ((1,1),(2,2))context: tú = María [femenino, 2a persona del singular]comment: You (Mary) fell

Page 14: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Outline

• Background– The AVENUE Machine Translation System

Contents of the RTB– An inventory of grammatical meanings– Languages that have been elicited

• Tools for RTB creation• Future work

– Evaluation – Navigation

Page 15: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Size of RTB

• Around 3200 sentences

• 20K words

Page 16: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Languages

• The set of feature structures with English sentences has been delivered to the Linguistic Data Consortium as part of the Reflex program.

• Translated (by LDC) into:– Thai– Bengali

• Plans to translate into:– Seven “strategic” languages per year for five years.

• As one small part of a language pack (BLARK) for each language.

Page 17: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Languages

• Feature structures are being reverse annotated in Spanish at New Mexico State University (Helmreich and Cowie)– Plans to translate into Guarani

• Reverse annotation into Portuguese in Brazil (Marcello Modesto)– Plans to translate into Karitiana

• 200 speakers

• Plans to translate into Inupiaq (Kaplan and MacLean)

Page 18: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Previous Elicitation Work

• Pilot corpus– Around 900 sentences– No feature structures

• Mapudungun– Two partial translations

• Quechua– Three translations

• Aymara– Seven translations

• Hebrew• Hindi

– Several translations• Dutch

Page 19: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Sample: clause level• Mary is writing a book for John.• Who let him eat the sandwich?• Who had the machine crush the

car?• They did not make the policeman

run.• Mary had not blinked.• The policewoman was willing to

chase the boy.• Our brothers did not destroy files.• He said that there is not a manual.• The teacher who wrote a textbook

left.• The policeman chased the man

who was a thief.• Mary began to work.

• Tense, aspect, transitivity• Questions, causation and

permission

• Interaction of lexical and grammatical aspect

• Volitionality

• Embedded clauses and sequence of tense

• Relative clauses

• Phase aspect

Page 20: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Sample: noun phrase level• The man quit in November.• The man works in the

afternoon.• The balloon floated over the

library.• The man walked over the

platform.• The man came out from

among the group of boys.• The long weekly meeting

ended.• The large bus to the post office

broke down.• The second man laughed.• All five boys laughed.

• Temporal and locative meanings

• Quantifiers• Numbers• Combinations of different types

of modifers– My book

• Possession, definiteness– A book of mine

• Possession, indefiniteness

Page 21: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Examplesrcsent: The large bus to the post office broke down.

((actor ((modifier ((mod-role mod-descriptor)(mod-role role-loc-general-to))) (np-identifiability identifiable)(np-specificity specific)(np-biological-gender bio-gender-n/a)(np-animacy anim-inanimate)(np-person person-third)(np-function fn-actor)(np-general-type common-noun-type)(np-number num-sg)(np-pronoun-exclusivity inclusivity-n/a)(np-pronoun-antecedent antecedent-n/a)(np-distance distance-neutral)))

(c-general-type declarative-clause)(c-my-causer-intentionality intentionality-n/a)(c-comparison-type comparison-n/a)(c-relative-tense relative-n/a)(c-our-boundary boundary-n/a)(c-comparator-function comparator-n/a)(c-causee-control control-n/a)(c-our-situations situations-n/a)(c-comparand-type comparand-n/a)(c-causation-directness directness-n/a)(c-source source-neutral)(c-causee-volitionality volition-n/a)(c-assertiveness assertiveness-neutral)(c-solidarity solidarity-neutral)(c-polarity polarity-positive)(c-v-grammatical-aspect gram-aspect-neutral)(c-adjunct-clause-type adjunct-clause-type-n/a)(c-v-phase-aspect phase-aspect-neutral)(c-v-lexical-aspect activity-accomplishment)(c-secondary-type secondary-neutral)(c-event-modality event-modality-none)(c-function fn-main-clause)(c-minor-type minor-n/a)(c-copula-type copula-n/a)(c-v-absolute-tense past)(c-power-relationship power-peer)(c-our-shared-subject shared-subject-n/a)(c-question-gap gap-n/a))

Page 22: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Grammatical meanings vs syntactic categories

• Features and values are based on a collection of grammatical meanings– Many of which are similar to the

grammatemes of the Prague Treebanks

Page 23: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Grammatical Meanings

YES• Semantic Roles• Identifiability• Specificity• Time

– Before, after, or during time of speech

• Modality

NO• Case• Voice• Determiners• Auxiliary verbs

Page 24: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Grammatical Meanings

YES• How is identifiability

expressed?– Determiner– Word order– Optional case marker– Optional verb agreement

• How is specificity expressed?

• How are generics expressed?

• How are predicate nominals marked?

NO• How are English

determiners translated?– The boy cried.– The lion is a fierce beast.– I ate a sandwich.– He is a soldier.

• Il est soldat.

Page 25: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Argument Roles

• Actor– Roughly, deep subject

• Undergoer– Roughly, deep object

• Predicate and predicatee– The woman is the manager.

• Recipient– I gave a book to the students.

• Beneficiary– I made a phone call for Sam.

Page 26: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Why not subject and object?

• Languages use their voice systems for different purposes.

• Mapudungun obligatorily uses an inverse marked verb when third person acts on first or second person.– Verb agrees with undergoer– Undergoer exhibits other subjecthood properties– Actor may be object.

• Yes: How are actor and undergoer encoded in combination with other semantic features like adversity (Japanese) and person (Mapudungun)?

• No: How is English voice translated into another language?

Page 27: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Argument Roles

• Accompaniment– With someone– With pleasure

• Material– (out) of wood

• About 20 more roles – From the Lingua checklist; Comrie & Smith (1977)– Many also found in tectogrammatical representations

• Around 80 locative relations– From Lingua checklist

• Many temporal relations

Page 28: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Noun Phrase Features

• Person• Number• Biological gender• Animacy• Distance (for deictics)• Identifiability• Specificity• Possession• Other semantic roles

– Accompaniment, material, location, time, etc.

• Type– Proper, common, pronoun

• Cardinals• Ordinals• Quantifiers• Given and new

information– Not used yet because of

limited context in the elicitation tool.

Page 29: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Clause level features

• Tense• Aspect

– Lexical, grammatical, phase

• Type– Declarative, open-q,

yes-no-q

• Function– Main, argument,

adjunct, relative

• Source– Hearsay, first-hand,

sensory, assumed

• Assertedness– Asserted,

presupposed, wanted

• Modality– Permission, obligation– Internal, external

Page 30: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Other clause types(Constructions)

• Causative– Make/let/have someone do something

• Predication– May be expressed with or without an overt copula.

• Existential– There is a problem.

• Impersonal– One doesn’t smoke in restaurants in the US.

• Lament– If only I had read the paper.

• Conditional• Comparative• Etc.

Page 31: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Outline

• Background– The AVENUE Machine Translation System

• Contents of the RTB– An inventory of grammatical meanings– Languages that have been elicited

Tools for RTB creation• Future work

– Evaluation – Navigation

Page 32: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Tools for RTB Creation

• Change the inventory of grammatical meanings

• Make new RTBs for other purposes

Page 33: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Mar 1, 2006

The Process

List of semantic features and values

The Corpus

Feature Maps: which combinations of features and values are of interest

…Clause-Level

Noun-Phrase

Tense & Aspect Modality

Feature Structure Sets

Feature Specification

Reverse Annotated Feature Structure Sets: add English sentences

Smaller CorpusSampling

Page 34: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Feature Specification

• XML Schema• XSLT Script• Human readable form

– Feature: Causer intentionality• Values: intentional, unintentional

– Feature: Causee control• Values: in control, not in control

– Feature: Causee volitionality• Values: willing, unwilling

– Feature: Causation type• Values: direct, indirect

Page 35: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Feature Combination

• Person and number interact with tense in many fusional languages.

• In English, tense interacts with questions:– Will you go?

Page 36: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Feature Combination Template((predicatee((np-general-type pronoun-type common-

noun-type)(np-person person-first person-second

person-third)(np-number num-sg num-pl)(np-biological-gender bio-gender-male bio-

gender-female)))

{[(predicate ((np-general-type common-noun-type)

(np-person person-third)))(c-copula-type role)][(predicate ((adj-general-type quality-type)(c-copula-type attributive)))][(predicate ((np-general-type common-

noun-type)(np-person person-third)(c-copula-type identity)))]}

(c-secondary-type secondary-copula) (c-polarity #all)

(c-general-type declarative)(c-speech-act sp-act-state)(c-v-grammatical-aspect gram-aspect-

neutral)(c-v-lexical-aspect state)(c-v-absolute-tense past present future)(c-v-phase-aspect durative))

Summarizes 288 feature structures, which are automatically generated.

Page 37: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Annotation Tool

• Feature structure viewer– Various views of the feature structure

• Omit features whose value is not-applicable• Group related features together

– Aspect– causation

Page 38: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Outline

• Background– The AVENUE Machine Translation System

• Contents of the RTB– An inventory of grammatical meanings– Languages that have been elicited

• Tools for RTB creationFuture work

– Evaluation – Navigation

Page 39: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Evaluation

• Current funding has not covered evaluation of the RTB.– Except for informal observations as it was

translated into several languages.

• Does it elicit the meanings it was intended to elicit?– Informal observation: usually

• Is it useful for machine translation?

Page 40: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Hard Problems

• Reverse annotating meanings that are not grammaticalized in English.– Evidentiality:

• He stole the bread.• Context: Translate this as if you do not

have first hand knowledge. In English, we might say, “They say that he stole the bread” or “I hear that he stole the bread.”

Page 41: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Hard Problems

• Reverse annotating things that can be said in several ways in English.– Impersonals:

• One doesn’t smoke here.• You don’t smoke here.• They don’t smoke here.• Credit cards aren’t accepted.

– Problem in the Reflex corpus because space was limited.

Page 42: Parallel Reverse Treebanks for the Discovery of Morpho-Syntactic Markings

Navigation

• Currently, feature combinations are specified by a human.

• Plan to work in active learning mode.– Build seed RTB– Translate some data– Do some learning– Identify most valuable pieces of information to get

next– Generate an RTB for those pieces of information– Translate more– Learn more– Generate more, etc.