treebanks and mwes (part 1) jan hajič, pavel straňák, jiří mírovský institute of formal and...

43
Treebanks and MWEs (Part 1) Jan Hajič, Pavel Straňák, Jiří Mírovský Institute of Formal and Applied Linguistics & LINDAT/CLARIN School of Computer Science Faculty of Mathematics and Physics Charles University in Prague Czech Republic

Upload: alison-palmer

Post on 01-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Treebanks and MWEs(Part 1)

Jan Hajič, Pavel Straňák, Jiří MírovskýInstitute of Formal and Applied Linguistics & LINDAT/CLARIN

School of Computer ScienceFaculty of Mathematics and Physics

Charles University in PragueCzech Republic

19.1.2015 PARSEME Training School Prague 2

Outline

• Treebanks– Phrase-(Constituency-) based: The Penn Treebank– Dependency: The Prague Dependency Treebanks

• The Penn Treebank (basics)• The Prague Dependency Treebank

– Layers of Annotation• Morphology• Syntax• Semantics• Valency

THE PENN TREEBANK

19.1.2015 PARSEME Training School Prague 3

19.1.2015 PARSEME Training School Prague 4

Phrase- vs. Dependency-Based Treebanks

• The original: The Penn Treebank– Phrase-based style; good for parsing by CFG grammars

• Followers– Almost all Penn-based treebanks

• Chinese, Arabic, Korean, …

– Negra (German), many others

• Now: dependency parsing prevails• Conversion from phrase-based treebanks

– Might lose information, heads added „ad hoc“

• “native” dependency treebanks: annotated as such– Considered “better”– Hindi/Urdu, TIGER (sort of); both styles manually annotated– PDT (of course) and similar ones

» PDT style treebanks: Danish, Croatian, Slovene, Greek, Latin

The Penn Treebank

• Published (first) in 1993, now LDC99T42 (www.ldc.upenn.edu)– First the Wall Street Journal part (1 mil. words, 2312 documents)

• Added other text types– ATIS corpus (dialogs, travel reservations)

– Brown corpus annotated for syntax

– Switchboard (spoken language, tel. conversations)

19.1.2015 PARSEME Training School Prague 5

19.1.2015 PARSEME Training School Prague 6

Penn Treebank Format

• ( (S • (NP-SBJ • (NP (NNP Pierre) (NNP Vinken) )• (, ,) • (ADJP • (NP (CD 61) (NNS years) )• (JJ old) )• (, ,) )• (VP (MD will) • (VP (VB join) • (NP (DT the) (NN board) )• (PP-CLR (IN as) • (NP (DT a) (JJ nonexecutive) (NN director) ))• (NP-TMP (NNP Nov.) (CD 29) )))• (. .) ))

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.

POS tag (NNS)(noun, plural)

Phrase label (NP)

Noun Phrase

“Preterminal”

The Penn Treebank(s)

• Extensions– Annotation of named entities, co-reference (BBN)

– cf. also previous slides

– Function labels (SBJ, OBJ, TMP, ...)– PropBank

• Penn Treebank syntax + Predicate-argument relations, added “frame files” (predicate dictionary)

(S (NP-SBJ (PRP I Arg0) VP (VBD gave Pred) (NP-DOBJ (PRP him Arg1) (NP-IOBJ (DET the) (NN book Arg2))) ... )

– NomBank• Like PropBank, but for nouns and their “arguments”

• Other languages (Chinese, Arabic, ...)

19.1.2015 PARSEME Training School Prague 7

THE PRAGUE DEPENDENCY TREEBANK

19.1.2015 PARSEME Training School Prague 8

19.1.2015 PARSEME Training School Prague 9

The Prague Dependency Treebanks: the Basics

• Original Treebank: PDT 1.0, 2001 (morf., dep. syntax)• First full release: PDT 2.0

– http://ufal.mff.cuni.cz/pdt2.0 • LDC2006T01, see http://www.ldc.upenn.edu

– Now: PDT 3.0: http://ufal.mff.cuni.cz/pdt3.0 • Basic general features

– Multilayered annotation, interlinked layers– Dependency-based syntax (both surface and deep)– Information structure of the sentence (topic/focus)– Grammatical and basic textual coreference– New: discourse relations, MWEs

• Languages: Czech, English (also parallel), Arabic– Student work on “samples”: Indonesian, Urdu, Russian, …– Spoken: work started on Czech and English (non-parallel, dialogs)

19.1.2015 PARSEME Training School Prague 10

The Prague Dependency

Treebank

• Three basic layers of annotation– Morphemic layer– Surface syntax (“analytical”) layer– “Tectogrammatical” layer:

underlying syntax, semantic roles (valency), inf. structure, coreference

• Size– 830,000 words (tokens) = 50000 sentences in 3165 full

documents (texts)• Format

– Prague Markup Language (XML-based)

– Now also: .treex format• For smooth uise in the TreeX platform• http://ufal.mff.cuni.cz/treex

19.1.2015 PARSEME Training School Prague 11

PDT (Czech) Data

• 4 sources:– Lidové noviny (daily newspaper, incl. extra sections)– DNES (Mladá fronta Dnes) (daily newspaper)– Vesmír (popular science magazine, monthly)– Českomoravský Profit (economical journal, weekly)

• Full articles selected– article ~ DOCUMENT (basic corpus unit)

• Time period: 1990-1995

• 1.8 million tokens (~110,000 sentences total)

19.1.2015 PARSEME Training School Prague 12

PDT Annotation Layers

• L0 (w) Words (tokens)– automatic segmentation and markup only

• L1 (m) Morphology– Tag (full morphology, 13 categories), lemma

• L2 (a) Analytical layer (surface syntax)– Dependency, analytical dependency function

• L3 (t) Tectogrammatical layer (“deep” syntax)– Dependency, “functor”, grammatemes, ellipsis

solution, coreference, topic/focus (deep word order), valency lexicon; PDT 3.0PDT 3.0: mass, clauses, formemes, discourse, ...

PD

T 1

.0

(200

1)

PD

T 2

.0

(200

6)

19.1.2015 PARSEME Training School Prague 13

PDT Annotation Layers

• L0 (w) Words (tokens)– automatic segmentation and markup only

• L1 (m) Morphology– Tag (full morphology, 13 categories), lemma

• L2 (a) Analytical layer (surface syntax)– Dependency, analytical dependency function

• L3 (t) Tectogrammatical layer (“deep” syntax)– Dependency, functor (detailed), grammatemes,

ellipsis solution, coreference, topic/focus (deep word order), valency lexicon

19.1.2015 PARSEME Training School Prague 14

Morphological Attributes

Tag: 13 categories Example: AAFP3----3N---- Adjective no poss. Gender negated Regular no poss. Number no voice Feminine no person reserve1 Plural no tense reserve2 Dative superlative base var.

Lemma: POS-unique identifierBooks/verb -> book-1, went -> go, to/prep. -> to-1

Ex.: nejnezajímavějším“(to) the most uninteresting”

19.1.2015 PARSEME Training School Prague 15

PDT Annotation Layers

• L0 (w) Words (tokens)– automatic segmentation and markup only

• L1 (m) Morphology– Tag (full morphology, 13 categories), lemma

• L2 (a) Analytical layer (surface syntax)– Dependency, analytical dependency function

• L3 (t) Tectogrammatical layer (“deep” syntax)– Dependency, functor (detailed), grammatemes,

ellipsis solution, coreference, topic/focus (deep word order), valency lexicon

19.1.2015 PARSEME Training School Prague 16

Layer 2 (a-layer): Analytical Syntax

• Dependency + Analytical Function

dependent

governor

The influence of the Mexicancrisis on Central and EasternEurope has apparently been underestimated.

19.1.2015 PARSEME Training School Prague 17

Analytical Syntax: Functions

• Main (for [main] semantic lexemes):• Pred, Sb, Obj, Adv, Atr, Atv(V), AuxV, Pnom• “Double” dependency: AtrAdv, AtrObj, AtrAtr

• Special (function words, punctuation,...):• Reflexives, particles: AuxT, AuxR, AuxO, AuxZ, AuxY• Prepositions/Conjunctions: AuxP, AuxC• Punctuation, Graphics: AuxX, AuxS, AuxG, AuxK

• Structural• Elipsis: ExD, Coordination etc.: Coord, Apos

PDT Annotation Layers

• L0 (w) Words (tokens)– automatic segmentation and markup only

• L1 (m) Morphology– Tag (full morphology, 13 categories), lemma

• L2 (a) Analytical layer (surface syntax)– Dependency, analytical dependency function

• L3 (t) Tectogrammatical layer (“deep” syntax)– Dependency, functor (detailed), grammatemes,

ellipsis solution, coreference, topic/focus (deep word order), valency lexicon

19.1.2015 PARSEME Training School Prague 18

Tectogrammatical Annotation

• Underlying (deep) syntax

• 5 sublayers (integrated and/or standoff annotation):– dependency structure, (detailed) functors

• valency annotation

– topic/focus and deep word order

– coreference (mostly grammatical only)

– discourse

– all the rest (grammatemes): • detailed functors• underlying gender, number, mass nouns, ...

• Total: 39 attributes (vs. 5 at m-layer, 2 at a-layer)

19.1.2015 PARSEME Training School Prague 19

19.1.2015 PARSEME Training School Prague 20

Tectogrammatical vs. analytical syntax

TR: Nofunction words

AR: All words

Predicate verb

“Location”

In practice, that procedure will require making of certified copies.

Re-inserted elided actorof “making”

Dependency Structure

• Similar to the surface (Analytical) layer... ...but:– certain nodes deleted

• auxiliaries, non-autosemantic words, punctuation• (some) multiword expressions -> 1 node

– some nodes added• based on word (mostly verb, noun) valency• some ellipsis resolution

– detailed dependency relation labels (functors)

19.1.2015 PARSEME Training School Prague 21

Tectogrammatical Functors

• “Actants”: ACT, PAT, EFF, ADDR, ORIG

– modify: verbs, nouns, adjectives

– cannot repeat in a clause, usually obligatory

• Free modifications (~ 50), semantically defined– can repeat; optional, sometimes obligatory– Ex.: LOC, DIR1, ...; TWHEN, TTILL,...; RSTR; BEN, ATT, ACMP,

INTT, MANN; MAT, APP; ID, DPHR, CPHR, ...

• Special– Coordination, Rhematizers, Foreign phrases (#Forn),...

19.1.2015 PARSEME Training School Prague 22

“syntactic” semantic

MW

Es

Deep Word Order Topic/Focus

• Example:

• Baker bakes rolls. vs. BakerIC bakes rolls.

19.1.2015 PARSEME Training School Prague 23

Analyticaldep. tree:

Deep Word OrderTopic/Focus

• Deep word order:– from “old” information to the “new” one (left-to-

right) at every level (head included)– projectivity by definition (almost...)

• i.e., partial level-based order -> total d.w.o.

• Topic/focus/contrastive topic– attribute of every node (t, f, c)– restricted by d.w.o. and other constraints

19.1.2015 PARSEME Training School Prague 24

Coreference

• Grammatical (easy)– relative clauses

• which, who– Peter and Paul, who ...

– control• infinitival constructions

– John promised to go home

– reflexive pronouns• {him,her,thme}self(-ves)

– Mary saw herself in ...

19.1.2015 PARSEME Training School Prague 25

Johngo

he home

promisePRED

ACTPAT

ACT DIR3

Coreference

• Textual– Ex.: Peter moved to Iowa after he finished his PhD.

19.1.2015 PARSEME Training School Prague 26

Peter Iowafinish

he PhD

movePRED

ACT DIR1TWHEN

ACT PAT

heAPP

Grammatemes

• Detailed functors (“subfunctors”)– needed for some functors:

• TWHEN: before/after• LOC: next-to, behind, in-front-of, ...• also: ACMP, BEN, CPR, DIR1, DIR2, DIR3, EXT

• Lexical (underlying)– number (Sg/Pl), tense, modality, degree of

comparison, mass-noun?; is_person_name, is_dsp_root, ...

19.1.2015 PARSEME Training School Prague 27

MW

Es

VALENCY IN PDT

19.1.2015 PARSEME Training School Prague 30

Prague Dependency Treebank & Valency

• Valency in the PDT– Valency lexicon for PDT– General valency lexicon

• Valency in deep vs. surface syntax– Links between the layers w.r.t. valency

• Valency and word sense– Sense-disambiguated occurrences:

• Links from data to the lexicon

• Valency in translation, text generation

19.1.2015 PARSEME Training School Prague 31

Definition of Valency

• Ability (“desire”) of words (verbs, nouns, adjectives) to combine themselves with other units of meaning

• Properties of valency:– Specific for every word meaning (in general)

• leave: sb left sth for sb vs. sb left from somewhere• similar to PropBank leave.02 vs. leave.01

– Typically strongly correlates with surface form (Czech)• morphological case (~ ending), preposition+case, ...)

– Semantic constraints

19.1.2015 PARSEME Training School Prague 32

Structure of Valency

• word (lemma) – word sense group 1

• valency frame:– slot1 slot2 slot3

• surface expression

– word sense group 2 • ...

19.1.2015 PARSEME Training School Prague 33

vyměnit (to replace) vyměnit1

ACT PAT EFF

Nom. Acc.za+Acc.

vyměnit2

...

PDT-Vallex Entry

• dosáhnout: “to reach”, “to get [sb to do sth]”

• browser/user-formatted example:

19.1.2015 PARSEME Training School Prague 34

MWEs in PDT-Vallex

• Types included:– Reflexive particle (se, si)

• smát se – to laugh• všimnout si – to notice

– Idiomatic constructions• dosáhnout svého - to achieve one’s goals• běhá mi mráz po zádech – to give me the shivers

– Light verb constructions (and similar)• uzavřit dohodu – to agree [on sth], strike an

agreement, ...• vzbuzovat pochybnosti – to doubt, to raise doubts

19.1.2015 PARSEME Training School Prague 35

smát

_se

(t_l

emm

a)

DP

HR

(ar

gum

ent)

CP

HR

(ar

gum

ent)

Corpus ↔ Valency Lexicon

19.1.2015 PARSEME Training School Prague 36

• Corpus:

ENTRY: uzavřít (to close) vf1: ACT(.1) CPHR({smlouva}.4)

ex: u. dohodu (close a contract)vf2: ACT(.1) PAT(.4)

ex.: u. pokoj (close a room, house)

Lexicon:

Sentence 2035: Sentence 15345: Sentence 51042:

Valency & Text Generation

19.1.2015 PARSEME Training School Prague 37

• Using valency for...– ...getting the correct (lemma, tag) of verb arguments

• Example:

starat_se

PRED

Martin

ACT

tygr

PAT

Martin

....1..........

starat

V..............

o

...............

tygr

....4..........

VALLEX entry: starat (se) ACT(.1) PAT(o.[.4])

se

...............

Martin se stará o tygry.

“Martin takes care of tigers.”

“to take care of”

“tiger”

PARALLEL TREEBANK CZ-EN

19.1.2015 PARSEME Training School Prague 38

19.1.2015 PARSEME Training School Prague 39

Parallel Czech-English Annotation

• English text → Czech text (human translation)• Czech side (goal): all layers manual annotation• English side (goal):

– Morphology and surface syntax: technical conversion• Penn Treebank style -> PDT Analytic layer

– Tectogrammatical annotation: manual annotation• (Slightly) different rules needed for English

• Alignment– Natural, sentence level only (now)

19.1.2015 PARSEME Training School Prague 41

English Annotation POS and Syntax

• Automatic conversion from Penn Treebank– PDT morphological layer

• From POS tags

– PDT analytic layer• From:

– Penn Treebank Syntactic Structure– Non-terminal labels– Function tags (non-terminal “suffixes”)

• 2-step process– Head determination rules– Conversion to dependency + analytic function

Czech-English Example

19.1.2015 PARSEME Training School Prague 46

Dicku Darmane, zavolejte do své kanceláře! Dick Darman, call your office!

SUMMARY OF PART 1/1

19.1.2015 PARSEME Training School Prague 47

19.1.2015 PARSEME Training School Prague 48

PDT Treebanks at UFAL (written language)

• Czech– Prague Dependency Treebank

• Complex annotation, all levels, additional annotation

– Translation of Penn Treebank• Tectogrammatical layer only, no t/f

– Analytical, morphology: automatic tool

• English– Re-annotation of Penn Treebank

• Other languages– Arabic (own annotation)– Other: by conversion (HamleDT – 30 treebanks)

Prague Dependency Treebanks

• Annotation:– 4 layers:

• Words, lemmas/tags, surface dep. syntax, tectogrammatics

– Tectogrammatical layer:• No function words, semantic relations• Valency/verb arguments (some MWE features)

– Separate valency lexicon, fully linked from PDT nodes

• Coreference, Topic/focus, Discourse • Links back to analytical layer (parsing!)

19.1.2015 PARSEME Training School Prague 49

Now also in the Universal Dependency format!

https://github.com/UniversalDependencies

19.1.2015 PARSEME Training School Prague 50

Pointers• PDT 2.0 (the “Original”), newest version: PDT 3.0

– http://ufal.mff.cuni.cz/pdt2.0– http://ufal.mff.cuni.cz/pdt3.0

• PCEDT– http://ufal.mff.cuni.cz/pcedt2.0/

• PEDT– English side of PCEDT, additional: NE, coreference– http://ufal.mff.cuni.cz/pedt2.0/

• PADT (Arabic, morphology + surface syntax)– http://ufal.mff.cuni.cz/padt

• Other corpora, PDT-Vallex, EngVallex:– Search at http://lindat.cz

• LDC catalog numbers:– LDC2006T01 (PDT 2.0), LDC2004T23 (PADT 1.0), LDC2004T25 (PEDT 1.0)

• CoNLL 2009 shared task (7 languages, surface syntax + predicate arguments only)

– http://ufal.mff.cuni.cz/conll2009-st • HamleDT 2.0 (30 treebanks in unified format)

– http://ufal.mff.cuni.cz/hamledt