treebanks, parsing, etc. · – grammar engineering • lovingly hand-crafted decades-long efforts...

43
Treebanks, parsing, etc.

Upload: others

Post on 02-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

Treebanks, parsing, etc.

Page 2: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

Syntax and computers

• Parsing: input is sentence, output is tree (or equivalent representation)

• Browsing: – Finding particular syntactic structures within a

corpus of sentences – Finding sentences that match a particular

syntactic construction

• Information retrieval, machine translation, speech recognition, etc.

Page 3: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

Why parsing is difficult: Newspaper headlines

• Iraqi Head Seeks Arms

• Juvenile Court to Try Shooting Defendant

• Teacher Strikes Idle Kids

• Stolen Painting Found by Tree

• Local High School Dropouts Cut in Half

• Red Tape Holds Up New Bridges

• Clinton Wins on Budget, but More Lies Ahead

• Hospitals Are Sued by 7 Foot Doctors

• Kids Make Nutritious Snacks

Page 4: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

4

Ambiguous headlines

POLICE BEGIN CAMPAIGN TO RUN DOWN JAYWALKERS SAFETY EXPERTS SAY SCHOOL BUS PASSENGERS SHOULD BE BELTED DRUNK GETS NINE MONTHS IN VIOLIN CASE FARMER BILL DIES IN HOUSE IRAQI HEAD SEEKS ARMS PROSTITUTES APPEAL TO POPE BRITISH LEFT WAFFLES ON FALKLAND ISLANDS LUNG CANCER IN WOMEN MUSHROOMS TEACHER STRIKES IDLE KIDS ENRAGED COW INJURES FARMER WITH AXE JUVENILE COURT TO TRY SHOOTING DEFENDANT TWO SOVIET SHIPS COLLIDE, ONE DIES

Page 5: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

Soar 2003 Tutorial 5

WordNet subcat frames

1 Something ----s 2 Somebody ----s 3 It is ----ing 4 Something is ----ing PP 5 Something ----s something Adjective/Noun 6 Something ----s Adjective/Noun 7 Somebody ----s Adjective 8 Somebody ----s something 9 Somebody ----s somebody 10 Something ----s somebody 11 Something ----s something 12 Something ----s to somebody 13 Somebody ----s on something 14 Somebody ----s somebody something 15 Somebody ----s something to somebody 16 Somebody ----s something from somebody 17 Somebody ----s somebody with something 18 Somebody ----s somebody of something 19 Somebody ----s something on somebody

20 Somebody ----s somebody PP 21 Somebody ----s something PP 22 Somebody ----s PP 23 Somebody's (body part) ----s 24 Somebody ----s somebody to INFINITIVE 25 Somebody ----s somebody INFINITIVE 26 Somebody ----s that CLAUSE 27 Somebody ----s to somebody 28 Somebody ----s to INFINITIVE 29 Somebody ----s whether INFINITIVE 30 Somebody ----s somebody into V-ing something 31 Somebody ----s something with something 32 Somebody ----s INFINITIVE 33 Somebody ----s VERB-ing 34 It ----s that CLAUSE 35 Something ----s INFINITIVE

Page 6: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

English LCS lexicon

• Theta-grid information for verbs

• Derive ucat features – used to build syntactic structure

• Co-referenced with WordNet2.0 – theta-grids are aligned with ucat features and

word sense information

Page 7: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

English LCS lexicon data

10.6.a#1#_ag_th,mod-poss(of)#exonerate#exonerate#exonerate#exonerate+ed# (2.0,00874318_exonerate%2:32:00::) "10.6.a" :NAME "Verbs of Possessional Deprivation: Cheat Verbs / -of“ WORDS (absolve acquit balk bereave bilk bleed burgle cheat cleanse con cull cure defraud denude deplete depopulate deprive despoil disabuse disarm disencumber dispossess divest drain ease exonerate fleece free gull milk mulct pardon plunder purge purify ransack relieve render rid rifle rob sap strip swindle unburden void wean) THETA_ROLES ((1 "_ag_th,mod-poss()") (1 "_ag_th,mod-poss(from)") (1 "_ag_th,mod-poss(of)")) SENTENCES "He !!+ed the people (of their rights); He !!+ed him of his sins"

Page 8: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

8

Doing syntax with computers

• To do this you need a grammar • So where do grammars come from?

– Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to

write grammars (typically in some particular grammar formalism of interest to the linguists developing the grammar).

– TreeBanks • Semi-automatically generated sets of parse trees for the

sentences in some corpus. Typically in a generic lowest common denominator formalism (of no particular interest to any modern linguist).

Page 9: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

9

TreeBanks

• TreeBanks provide a grammar (of a sort). • Hence they provide the training data for various computer

applications that use syntax • But they can also provide useful data for more purely

linguistic pursuits. – You might have a theory about whether or not something can

happen in particular language. – Or a theory about the contexts in which something can happen. – TreeBanks can give you the means to explore those theories. If you

can formulate the questions in the right way and get the data you need.

Page 10: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

A Penn Treebank sentence

( (S (NP-SBJ (DT The) (NN move)) (VP (VBD followed) (NP (NP (DT a) (NN round)) (PP (IN of) (NP (NP (JJ similar) (NNS increases)) (PP (IN by) (NP (JJ other) (NNS lenders))) (PP (IN against) (NP (NNP Arizona) (JJ real) (NN estate) (NNS loans)))))) (, ,) (S-ADV (NP-SBJ (-NONE- *)) (VP (VBG reflecting) (NP (NP (DT a) (VBG continuing) (NN decline)) (PP-LOC (IN in) (NP (DT that) (NN market))))))) (. .)))

Page 11: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

11

Equivalent representations

• PS tree (phrase-markers)

• Bracketed labeling

• Automaton

• F-structure

Page 12: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

12

Bracketed labeling

[IP[NP[DetThe] [Ndog]] [VP[vbarked] [PP [Pat] [NP[Detthe] [Nboy]]]]]].

Page 13: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

An automaton

13

Page 14: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

F-structure

Page 15: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

Time to be flexible!

• We have learned a way to diagram parse trees; it involves certain assumptions

• Not everybody agrees with all of these assumptions

• In fact, very few people agree on very many specifics at all

• Syntax resources reflect this diversity • Hence the need to be flexible

Page 16: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

16

flight

Page 17: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

17

flight

flight

Page 18: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

18

flight

flight

flight

Page 19: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

19

Page 20: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest
Page 21: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

Classical grammar engineering

• Write rules with associated lexicon – S → NP VP NN → interest – NP → (DT) NN NNS → rates – NP → NN NNS NNS → raises – NP → NNP VBP → interest – VP → V NP VBZ → rates – Simple 10 rule grammar: 592 parses for some

ambiguous sentences – Real-size broad-coverage grammar: millions of

parses for a complicated sentence

Page 22: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

A simple grammar

S → NP VP 1.0 VP → V NP 0.7 VP → VP PP 0.3 PP → P NP 1.0 P → with 1.0 V → saw 1.0

NP → NP PP 0.4 NP → astronomers 0.1 NP → ears 0.18 NP → saw 0.04 NP → stars 0.18 NP → telescope 0.1

Page 23: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

23

Ambiguity

Page 24: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

Ambiguity

• Tree for: Fed raises interest rates 0.5% in effort to control inflation (NYT headline 5/17/00)

Page 25: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

Local V/N ambiguities

Page 26: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

Ambiguity

• Local ambiguity means that we have to deal with multiple plausible choices during the parsing process.

• Global ambiguity means that the grammar can’t tell us which of several (many?) possible parses is the correct one.

26

Page 27: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

Two possible PP attachments

Page 28: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest
Page 29: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

29

Sample treebank parse

Page 30: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

30

Sample treebank sentence

Page 31: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

31

Sample NP rules

Page 32: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

11/2/2011 CSCI 5832 Spring 2006 32

Example

Page 33: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

How many rules?

Page 34: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

A sample parsed sentence

Page 36: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

PP attachment ambiguity (German)

Page 37: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

PP attachment in Chinese

Page 38: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

Sample trees

Page 39: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

Searching treebank corpora

• Online – The Treebank Tool Suite

– The VISL website

– The NCLT website

• Offline – Treebank corpus

– Search utilities: tgrep, tregex, etc.

Page 40: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

tgrep

Page 41: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

Im APPRART

Dat

in

nächsten ADJA

Sup.Dat. Sg.Neut

nahe

Jahr NN Dat.

Pl.Neut Jahr

. $.

HD SB OC

HD OA MO

AC NK NK NK NK NK NK

S

VP

NP NP PP

will VMFIN

3.Sg. Pres.Ind wollen

die ART Nom.

Sg.Fem die

Regierung NN

Nom. Sg.Fem

Regierung

ihre PPOSAT

Acc. Pl.Masc

ihr

Reformpläne NN Acc.

Pl.Masc Plan

umsetzen VVINF

Inf

umsetzen

annotation on word level: part-of-speech,

morphology, lemmata

TiGer Treebank

node labels: phrase categories

edge labels: syntactic functions crossing branches for

discontinuous constituency types

Page 42: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

Parallel treebanks

• Translation training and studies

• Machine translation (MT) research & development

Page 43: Treebanks, parsing, etc. · – Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to write grammars (typically in some particular grammar formalism of interest

Aligning parses