treebanks are not naturally occurring data choices in treebank design and what they mean for natural...

Post on 19-Dec-2015

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Treebanks are Not Naturally Occurring Data

Choices in Treebank Design and What They Mean for Natural Language Processing

Owen RambowColumbia University – CCLS

rambow@ccls.columbia.edu

Goal of this Talk

• In natural language processing, we all use syntactic representations (treebanks)

• In theoretical syntax, they use syntactic representations• Problems in both communities:

– Little reflection on what these syntactic representations mean

– Representation confused with phenomenon which is the object of scientific investigation: natural language syntax

• Goal: increase meta-scientific understanding of syntactic representations for NLP

Overview

• What is a Syntactic Representation?• Dependency and Phrase Structure – Definitions– Syntactic and representational constituency and

dependency– Intermediate projections

• What does this Mean for NLP?• Takehome lessons

Acknowledgements

• Rajesh Bhatt and Fei Xia – Bhatt, Rambow and Xia: Linguistic Phenomena,

Analyses, and Representations: Understanding Conversion between Treebanks, IJCNLP 2011

• Hindi-Urdu Treebank Project– Dipti Misra Sharma (IIIT Hyderabad)– Martha Palmer (Colorado)– NSF Grant

• All errors and infelicitous choices in this presentation are my own

What is a Syntactic Representation?

1. Syntactic phenomena, e.g.:– Subject of a verb– Relative clause– Small clauseLinguists tend to agree on what phenomena exist

2. Mathematical representation type, e.g.:– Phrase structure tree– Dependency tree– Or something more complicated: LFG, TAG, …

3. Formal syntactic description:a. Mapping from phenomena to representations (in particular type)b. Chosen representation for a specific phenomenon also called

analysisc. Phenomena extracted in representation are the interpretationd. Formal description is a syntactic theory if it makes predictions

Example: Small Clauses

• Hindi– अा��ति�फ ने सी�मा� को� बेवको� फ सीमाझा� – Atif ne Seema ko bewakuuf samjhaa – Atif Erg Seema Acc stupid consider.Pfv – ‘Atif considered Seema stupid.’

• English– Atif considered Seema stupid– Atif considered her stupid

What is the Phenomenon?

• Syntactically and semantically, consider takes a clausal complement– Atif considered [clause that she is stupid]

– Atif considered [clause her stupid]

• But two problems:– No verb – her is semantically subject of stupid but has accusative

case, which is unusual (subjects are usually nominative)• So:– Atif considered [small clause her stupid]

What is the Representation Type?

• For this example, we will show dependency trees and phrase structure trees

Analysis 1a for Small Clauses:No Accusative Case Marking

• Structure represents her as subject but not accusative case marking of her

considers

Atif stupid

Subj Obj

her

Subj

Analysis 1b for Small Clauses:Exceptional Case Marking

• Structure represents her as subject and accusative case marking through node label

considers

Atif stupid

Subj Obj-ECM

her

Subj

Analysis 1a for Small Clauses:No Accusative Case Marking

• Structure represents her as subject but not accusative case marking of her

considers

Atif stupid

Subj Obj

her

Subj

S

NP

Atif

VP

considers S

her VP

AdjPstupid

Analysis 1b for Small Clauses:Exceptional Case Marking

• Structure represents her as subject but not accusative case marking of her

considers

Atif stupid

Subj Obj-ECM

her

Subj

S

NP

Atif

VP

considers SC

her VP

AdjPstupidClose to analysis adopted in Chomsky (1981)

Note on DS and PS

• These analyses are intuitively very similar• Formal notion: “consistency” (Fei Xia, see

Bhatt, Rambow & Fei 2011)– Intution: very simple and general algorithm can

transform consistent DS to PS and vice versaI

Analysis 2a for Small Clauses:General Monoclausal Analysis

• Structure represents accusative case marking of her (as object of matrix verb) but not her as semantic subject

considers

Atif stupid

Subj Obj2

her

Obj

Analysis 2b for Small Clauses:Syntactic Monoclausal Analysis

• Structure represents accusative case marking of her (as object of matrix verb) and her as semantic subject using node label

considers

Atif stupid

Subj ObjPred

her

Obj

Analysis 2b for Small Clauses:Syntactic Monoclausal Analysis

• Structure represents accusative case marking of her (as object of matrix verb) and her as semantic subject using node label

considers

Atif stupid

k1 k2s

her

k2

Neo-Paninian analysis from IIIT Hyderabad,Used for DS in Hindi-Urdu Treebank

Analysis 2b for Small Clauses:Syntactic Monoclausal Analysis

• Structure represents accusative case marking of her (as object of matrix verb) and her as semantic subject using node label

सीमाझा�

अा��ति�फ ने बेवको� फ

k1 k2s

सी�मा� को�

k2

Neo-Paninian analysis from IIIT Hyderabad,Used for DS in Hindi-Urdu Treebank

Analysis 2a for Small Clauses:General Monoclausal Analysis

• Structure represents accusative case marking of her (as object of matrix verb) but not her as semantic subject

considers

Atif stupid

Subj Obj2

her

ObjNP

Atif

VP

considers her AdjP

stupid

S

Analysis 2b for Small Clauses:Syntactic Monoclausal Analysis

• Structure represents accusative case marking of her (as object of matrix verb) and her as semantic subject using node label

considers

Atif stupid

k1 k2s

her

k2NP

Atif

VP

considers her

AdjP

stupid

S

SC

Analysis 3 for Small Clauses:Raising to Object

• Structure represents accusative case marking of her and her as semantic subject but requires empty category

considers

Atif stupid

Subj Obj-Pred

her1

Obj

e1

Subj

Analysis 3 for Small Clauses:Raising to Object

• Structure represents accusative case marking of her and her as semantic subject but requires empty category

considers

Atif stupid

Subj Obj-Pred

her1

Obj

e1

Subj

S

NP

Atif

VP

considers S

VP

AdjPstupid

her1

e1

Analysis used for PS in Hindi-Urdu Treebank

Comparison of Representations

• Less Information • Same information

considers

Atif stupid

Subj Obj

her

Subj

considers

Atif stupid

Subj Obj2

her

Obj

considers

Atif stupid

Subj Obj-Pred

her1

Obj

e1

Subj

considers

Atif stupid

Subj ObjPred

her

Obj

considers

Atif stupid

Subj Obj-ECM

her

Subj

Tree 1a

Tree 2a

Tree 1b

Tree 2b

Tree 3

Initial Summary: Syntactic Phenomena, Representation Types, Analyses

• Syntactic phenomena are the empirical data of syntax as part of the science of language– Can be very similar across languages

• There can be several possible analyses– Some have less information– But there can be different analyses that represent

the same information differently• The analyses can be similar in DS and PS• Lots of choices in treebank design!

Overview

• What is a Syntactic Representation?• Dependency and Phrase Structure – Definitions– Syntactic and representational constituency and

dependency– Intermediate projections

• What does this Mean for NLP?• Takehome lessons

Representation Types:Dependency and Phrase Structure

• Dependency Tree (DS):– One label alphabet, words (= words in a sentence)– All nodes labeled with words or empty strings

• Phrase Structure Tree (PS):– Two disjoint label alphabets, terminals (= words in

sentence) and nonterminals– All and only interior nodes are labeled with

nonterminals– Leaves are labeled with terminals or empty strings

Non-Differences Between DS and PS

• WRONG: “DS can’t have empty categories”

considers

Atif stupid

Subj Obj-Pred

her1

Obj

e1

Subj

Non-Differences Between DS and PS

• WRONG: “DS must be unordered, PS must be ordered”

that the dancer a flower the violinist gives

Non-Differences Between DS and PS

• WRONG: “PS can’t be non-projective/have crossing arcs”

Overview

• What is a Syntactic Representation?• Dependency and Phrase Structure – Definitions– Syntactic and representational constituency and

dependency– Intermediate projections

• What does this Mean for NLP?• Takehome lessons

The Double Life of Dependency and of Constituency

• There are two meanings to “dependency”:– A type of tree (no nonterminal labels)– A type of syntactic phenomenon, namely a

relation between words (e.g., “subjecthood”)• There are two meanings to “constituency”– A type of tree (terminal labels on leaf nodes; = PS

tree)– A type of syntactic phenomenon, namely a

grouping of words into phrases

Mathematical objects = tools for linguists to

represent data

Empirical facts = data for linguists

Don’t confuse data with its representation!

Constituency in PS Representation but not in Syntax

• VP premodifiers in PTB can be either sister to VP or in VP, there is no difference in meaning– (S (NP-SBJ Sandy

(ADVP-TMP often) (VP throws (NP curves)))

– (S (NP-SBJ Sandy (VP (ADVP-TMP often) throws (NP curves)))

• “The variation is in general free” (PTB Guidelines p. 137)

Constituency in Syntax but not in PS Representation

• Flat NP in PTB– Syntax of Adj-N-N sequences: • (green (baby bassinet))• ((small business) administration)

– Representation in Penn Treebank:• (green baby bassinet)• (small business administration)

Conscious decision not to represent NP-internal pre-nominal sturcture

Dependency in DS Representation but not in Syntax

• --- links in CATiB dependency treebank of Arabic– Signal that connected words are not analyzed by

annotators for dependency• Pof link in DS part of Hindi-Urdu Treebank– Used for multiword expressions which form a unit

syntactically (and thus have no structure) but may occur apart in sentence

Dependency in Syntax but not in DS Representation

• Small clauses in DS for Hindi-Urdu Treebank: dependency between embedded predicate and its semantic subject not marked as dependency link in tree

considers

Atif stupid

k1 k2s

her

k2

Aren’t DS and PS Representations Complementary? NO!

• Syntactic dependency can be encoded in DS, and typically is

• Usual convention: attachment in projection shows type of dependency

Aren’t DS and PS Representations Complementary? NO!

• Syntactic constituency is represented in DS• Usual convention: each node is the word, and

the head of the phrase containing it and all descendents

Overview

• What is a Syntactic Representation?• Dependency and Phrase Structure – Definitions– Syntactic and representational constituency and

dependency– Intermediate projections

• What does this Mean for NLP?• Takehome lessons

But What About Intermediate Projections?

• If PS:– only has preterminals (eg V) and maximal

projections (eg S);– and no attchment happens at preterminal;then we have trivial correspondence to DS (consistency in the formal sense of Xia)

• Then PS and DS must have the exact same expressive power

No Intermediate Projections: DS and PS Representationally Equivalent

S

N Adv NV

Rajesh happily eats laddus

eats

Rajesh happily laddus

What about the POS tags in the DS?

No Intermediate Projections: DS and PS Representationally Equivalent

S

N Adv NV

Rajesh happily eats laddus

eatsV

RajeshN happilyAdv laddusN

What about grammatical functions?

No Intermediate Projections: DS and PS Representationally Equivalent

S

NSubj AdvAdj NObjV

Rajesh happily eats laddus

eatsV

RajeshN happilyAdv laddusN

Subj ObjAdj

Intermediate Projections

• If PS only has preterminals (eg V) and maximal projections (eg S), we have trivial correspondence to DS

• No direct equivalent of intermediate projections (eg VP) in DS

Intermediate Projections: DS and PS Different!

S

NP

Adv NPVRajesh

happily eats laddus

eatsV

RajeshN happilyAdv laddusN

Subj ObjAdj

VP

VP

AdvPN

Note: X– X’—XP = maximal projectionEXCEPT: V—VP—S:

VP = intermediate projection

Intermediate Projections

• If PS only has preterminals (eg V) and maximal projections (eg S), we have trivial correspondence to DS

• No direct equivalent of intermediate projections (eg VP) in DS – we have a problem?

• What is important: how intermediate projections are used in the formal syntactic description

• A node is not information about the syntax, only a node and its interpretation

Why Have Intermediate Projections?

• Evidence for IntProj as a constituent– VP fronting in English

• Use of a VP to represent semantic facts– Interpretation of adverbs in English

• Use of IntProj to represent syntactic facts– Subject and object of verbs in English– Genitive versus adjectival modification in Arabic– Also: • Arguments of verbs in Hindi• Verb second in German

Intermediate Projection as Constituent:VP Fronting in English

• VP can be moved as a constituent– Rajesh said he would [VP eat laddus]

and [VP eat laddus] Rajesh did

VP Fronting in English

S

NP

didRajesh

VP

AuxN

NPV

eat laddus

VP

Sbar

What can Dependency Do?

ε1

laddus

Rajesh did

Aux

Obj

Subj

eat1

Top

did

laddus

Rajesh

Obj

Subj

eat

Top

did

Rajesh

Subj

laddus

Obj

eat

Comp

eat

did laddus

ObjAux

Rajesh

Subj

If auxiliary did is head(basically, we have a VP!)

If eat is head

Issue: two possible analysesof auxiliary-verb structures in DS

• Possible claim:– VP adverbs characterize manner of event

• John quickly/happily played chess

– S adverbs characterize speaker attitude towards event:• Happily/fortunately (enough), John lost against me at chess

• Problem: interpretation independent of word order– Quickly/happily John played chess– John happily/oddly (enough) lost against me at chess

• Conflict between using VP to identify subject and using VP to describe semantics of adverbs– Would need traces to show scope

• In fact, Penn Treebank for English does not use VPs for scope

Intermediate Projections and Semantic Scope

Use of VP to Indicate Subject

considers

Atif stupid

Subj Obj-Pred

her1

Obj

e1

Subj

S

NP

Atif

VP

considers S

VP

AdjPstupid

her1

e1

Projections:Noun Modification in Arabic Treebank

• Idafa (genitive) construction: – (NP (N بيت/byt/house)

(NP (N المصري/AlmSry/the-Egyptian)))the house of the Egyptian

• Modification construction:– (NP (N الوزير/Alwzyr/the-minister)

(Adj المصري/AlmSry/the-Egyptian)) the Egyptian minister

• Syntactic difference signaled by projection to X or XP• Can’t do in DS – need labels

Conclusion from Intermediate Projections

• Intermediate projections (eg VP, N’) are specific to PS

• They can be used to encode syntactic or semantic information

• But: DS can express the same information!• Conclusion: the existence of intermediate

projections in PS is not a reason to say that PS is more expressive than DS

Overview

• What is a Syntactic Representation?• Dependency and Phrase Structure – Definitions– Syntactic and representational constituency and

dependency– Intermediate projections

• What does this Mean for NLP?• Takehome lessons

What Does This Mean for NLP?

• Treebanks are not naturally occurring data • The guidelines are painstakingly produced by linguists and

represent a formal description of the language• Annotators understand a sentence, determine what syntactic

phenomena exist, and use the guidelines to choose an analysis for the sentence (a structure)

• Users of the treebank can use the guidelines to interpret the structures and get back the syntactic phenomena present

• These phenomena, and not their representation in the treebank, can be used for NLP in whatever representation chosen by the researcher!

• There is already lots of linguistics in our resources, we just need to make use of that linguistic information!

Conclusion

• 4 takehome lessons

Lesson 1: There are All Sortsof Trees

Fallacy: DS trees are unordered,have no empty categories,… and PS trees are ordered and …

Truth: DS and PS trees are graphs that can

have all sorts of properties

Lesson 2:Trees Need an Interpretation

Fallacy: Trees have intrinsic syntactic meaning

Truth: Trees only gain syntactic meaning through interpretation;in treebanks: guidelines

Lesson 3: Conversion between Formats Depends on Interpretation

Fallacy: It is easier/harder to convertDS to PS than vice versa

Truth: conversion means preserving interpretation; thus it depends on the two representations between which conversion happens

Lesson 4: There are two Meanings to “Constituency” and “Dependency”

Fallacy: DS trees only show dependency and PS trees only show constituency

Truth: there are distinct mathematical (data structure) notions and linguistic notions of “dependency” and

“phrase structure”; each tree type can show each

Thank you!

top related