treebanks, parsing, etc. · – grammar engineering • lovingly hand-crafted decades-long efforts...
TRANSCRIPT
Treebanks, parsing, etc.
Syntax and computers
• Parsing: input is sentence, output is tree (or equivalent representation)
• Browsing: – Finding particular syntactic structures within a
corpus of sentences – Finding sentences that match a particular
syntactic construction
• Information retrieval, machine translation, speech recognition, etc.
Why parsing is difficult: Newspaper headlines
• Iraqi Head Seeks Arms
• Juvenile Court to Try Shooting Defendant
• Teacher Strikes Idle Kids
• Stolen Painting Found by Tree
• Local High School Dropouts Cut in Half
• Red Tape Holds Up New Bridges
• Clinton Wins on Budget, but More Lies Ahead
• Hospitals Are Sued by 7 Foot Doctors
• Kids Make Nutritious Snacks
4
Ambiguous headlines
POLICE BEGIN CAMPAIGN TO RUN DOWN JAYWALKERS SAFETY EXPERTS SAY SCHOOL BUS PASSENGERS SHOULD BE BELTED DRUNK GETS NINE MONTHS IN VIOLIN CASE FARMER BILL DIES IN HOUSE IRAQI HEAD SEEKS ARMS PROSTITUTES APPEAL TO POPE BRITISH LEFT WAFFLES ON FALKLAND ISLANDS LUNG CANCER IN WOMEN MUSHROOMS TEACHER STRIKES IDLE KIDS ENRAGED COW INJURES FARMER WITH AXE JUVENILE COURT TO TRY SHOOTING DEFENDANT TWO SOVIET SHIPS COLLIDE, ONE DIES
Soar 2003 Tutorial 5
WordNet subcat frames
1 Something ----s 2 Somebody ----s 3 It is ----ing 4 Something is ----ing PP 5 Something ----s something Adjective/Noun 6 Something ----s Adjective/Noun 7 Somebody ----s Adjective 8 Somebody ----s something 9 Somebody ----s somebody 10 Something ----s somebody 11 Something ----s something 12 Something ----s to somebody 13 Somebody ----s on something 14 Somebody ----s somebody something 15 Somebody ----s something to somebody 16 Somebody ----s something from somebody 17 Somebody ----s somebody with something 18 Somebody ----s somebody of something 19 Somebody ----s something on somebody
20 Somebody ----s somebody PP 21 Somebody ----s something PP 22 Somebody ----s PP 23 Somebody's (body part) ----s 24 Somebody ----s somebody to INFINITIVE 25 Somebody ----s somebody INFINITIVE 26 Somebody ----s that CLAUSE 27 Somebody ----s to somebody 28 Somebody ----s to INFINITIVE 29 Somebody ----s whether INFINITIVE 30 Somebody ----s somebody into V-ing something 31 Somebody ----s something with something 32 Somebody ----s INFINITIVE 33 Somebody ----s VERB-ing 34 It ----s that CLAUSE 35 Something ----s INFINITIVE
English LCS lexicon
• Theta-grid information for verbs
• Derive ucat features – used to build syntactic structure
• Co-referenced with WordNet2.0 – theta-grids are aligned with ucat features and
word sense information
English LCS lexicon data
10.6.a#1#_ag_th,mod-poss(of)#exonerate#exonerate#exonerate#exonerate+ed# (2.0,00874318_exonerate%2:32:00::) "10.6.a" :NAME "Verbs of Possessional Deprivation: Cheat Verbs / -of“ WORDS (absolve acquit balk bereave bilk bleed burgle cheat cleanse con cull cure defraud denude deplete depopulate deprive despoil disabuse disarm disencumber dispossess divest drain ease exonerate fleece free gull milk mulct pardon plunder purge purify ransack relieve render rid rifle rob sap strip swindle unburden void wean) THETA_ROLES ((1 "_ag_th,mod-poss()") (1 "_ag_th,mod-poss(from)") (1 "_ag_th,mod-poss(of)")) SENTENCES "He !!+ed the people (of their rights); He !!+ed him of his sins"
8
Doing syntax with computers
• To do this you need a grammar • So where do grammars come from?
– Grammar engineering • Lovingly hand-crafted decades-long efforts by humans to
write grammars (typically in some particular grammar formalism of interest to the linguists developing the grammar).
– TreeBanks • Semi-automatically generated sets of parse trees for the
sentences in some corpus. Typically in a generic lowest common denominator formalism (of no particular interest to any modern linguist).
9
TreeBanks
• TreeBanks provide a grammar (of a sort). • Hence they provide the training data for various computer
applications that use syntax • But they can also provide useful data for more purely
linguistic pursuits. – You might have a theory about whether or not something can
happen in particular language. – Or a theory about the contexts in which something can happen. – TreeBanks can give you the means to explore those theories. If you
can formulate the questions in the right way and get the data you need.
A Penn Treebank sentence
( (S (NP-SBJ (DT The) (NN move)) (VP (VBD followed) (NP (NP (DT a) (NN round)) (PP (IN of) (NP (NP (JJ similar) (NNS increases)) (PP (IN by) (NP (JJ other) (NNS lenders))) (PP (IN against) (NP (NNP Arizona) (JJ real) (NN estate) (NNS loans)))))) (, ,) (S-ADV (NP-SBJ (-NONE- *)) (VP (VBG reflecting) (NP (NP (DT a) (VBG continuing) (NN decline)) (PP-LOC (IN in) (NP (DT that) (NN market))))))) (. .)))
11
Equivalent representations
• PS tree (phrase-markers)
• Bracketed labeling
• Automaton
• F-structure
12
Bracketed labeling
[IP[NP[DetThe] [Ndog]] [VP[vbarked] [PP [Pat] [NP[Detthe] [Nboy]]]]]].
An automaton
13
F-structure
Time to be flexible!
• We have learned a way to diagram parse trees; it involves certain assumptions
• Not everybody agrees with all of these assumptions
• In fact, very few people agree on very many specifics at all
• Syntax resources reflect this diversity • Hence the need to be flexible
16
flight
17
flight
flight
18
flight
flight
flight
19
Classical grammar engineering
• Write rules with associated lexicon – S → NP VP NN → interest – NP → (DT) NN NNS → rates – NP → NN NNS NNS → raises – NP → NNP VBP → interest – VP → V NP VBZ → rates – Simple 10 rule grammar: 592 parses for some
ambiguous sentences – Real-size broad-coverage grammar: millions of
parses for a complicated sentence
A simple grammar
S → NP VP 1.0 VP → V NP 0.7 VP → VP PP 0.3 PP → P NP 1.0 P → with 1.0 V → saw 1.0
NP → NP PP 0.4 NP → astronomers 0.1 NP → ears 0.18 NP → saw 0.04 NP → stars 0.18 NP → telescope 0.1
23
Ambiguity
Ambiguity
• Tree for: Fed raises interest rates 0.5% in effort to control inflation (NYT headline 5/17/00)
Local V/N ambiguities
Ambiguity
• Local ambiguity means that we have to deal with multiple plausible choices during the parsing process.
• Global ambiguity means that the grammar can’t tell us which of several (many?) possible parses is the correct one.
26
Two possible PP attachments
29
Sample treebank parse
30
Sample treebank sentence
31
Sample NP rules
11/2/2011 CSCI 5832 Spring 2006 32
Example
How many rules?
A sample parsed sentence
Not just newswire…
PP attachment ambiguity (German)
PP attachment in Chinese
Sample trees
Searching treebank corpora
• Online – The Treebank Tool Suite
– The VISL website
– The NCLT website
• Offline – Treebank corpus
– Search utilities: tgrep, tregex, etc.
tgrep
Im APPRART
Dat
in
nächsten ADJA
Sup.Dat. Sg.Neut
nahe
Jahr NN Dat.
Pl.Neut Jahr
. $.
HD SB OC
HD OA MO
AC NK NK NK NK NK NK
S
VP
NP NP PP
will VMFIN
3.Sg. Pres.Ind wollen
die ART Nom.
Sg.Fem die
Regierung NN
Nom. Sg.Fem
Regierung
ihre PPOSAT
Acc. Pl.Masc
ihr
Reformpläne NN Acc.
Pl.Masc Plan
umsetzen VVINF
Inf
umsetzen
annotation on word level: part-of-speech,
morphology, lemmata
TiGer Treebank
node labels: phrase categories
edge labels: syntactic functions crossing branches for
discontinuous constituency types
Parallel treebanks
• Translation training and studies
• Machine translation (MT) research & development
Aligning parses