cs 730: text mining for social media & collaboratively generated content

CS 730: Text Mining for Social Media & Collaboratively Generated Content

Lecture 3: Parsing and Chunking

CS730: Text Mining for Social Media, F2010 2

“Big picture” of the course

Language models (word, n-gram, …)Classification and sequence models

WSD, Part-of-speech taggingSyntactic parsing and tagging• Next week: semantics• Next->Next week: Info Extraction + Text Mining Intro

– Fall break• Part II: Social media (research papers start)

9/20/2010

CS730: Text Mining for Social Media, F2010 3

Today’s Lecture Plan

• Phrase Chunking

• Syntactic Parsing

• Machine Learning-based Syntactic Parse

9/20/2010

CS730: Text Mining for Social Media, F2010 49/20/2010

Phrase Chunking


Why Chunking/Parsing?


Phrase Structure (continued)


Types of Phrases• Phrases: classify by part of speech of main word or by syntactic role

– subject and predicate; noun phrase and verb phraseIn "The young cats drink milk.“

"The young cats" is a noun phrase and the subject;"drink milk" is a verb phrase and the predicate the main word is the head of the phrase: "cats" in "the young cats"

• Verb complements and modifiers – types of complements ... noun phrases, adjective phrases, prepositional phrases, particles

noun phrase: I served a brownie.adjective phrase: I remained very rich.prepositional phrase: I looked at Fred.particles: He looked up the number.

– clauses; clausal complements • I dreamt that I won a million brownies.

– tenses: simple past, present, future; progressive, perfectsimple present: John bakes cookies.present progressive: John is baking cookies.present perfect: John has baked cookies.

– active vs. passiveactive: Bernie ate the banana.passive: The banana was eaten by Bernie.


Noun Phrase Structure• Left modifiers:

– determiner, quantifier, adjective, noun: the five shiny tin cans

• Right modifiers: prepositional phrases and apposition– prepositional phrase: the man in the moon– apposition: Scott, the Arctic explorer

• Relative clauses

the man who ate the popcornthe popcorn which the man atethe man who is eating the popcornthe tourist who was eaten by a lion

• Reduced relative clausesthe man eating the popcornthe man eaten by a lion


Attachment Ambiguities


Preliminaries: Constraint Grammars/CFGs


CFG (applying rewrite rules)


Preliminaries: CFG (continued)


Parsing


Human parsing


Chunking (from Abney 1994)• “I begin with an intuition: when I read a sentence, I read it

a chunk at a time.”– Breaks up something like this:

• [I begin] [with an intuition]: [when I read] [a sentence], [I read it] [a chunk] [at a time]

• Chunks correspond to prosodic patterns. – Strongest stresses in the sentence fall one to a chunk– Pauses are most likely to fall between chunks

• Typical chunk consists of a single content word surrounded by a constellation of function words, matching a fixed template.

• A simple context-free grammar is often adequate to describe the structure of chunks.


Chunking (continued)• Text chunking subsumes a range of tasks.

– The simplest is finding noun groups or base NPs:• non-recursive noun phrases up to the head (for English).

– More ambitious systems may add additional chunk types, such as verb groups

– Seek a complete partitioning of the sentence into chunks of different types:

[NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only $1.8 billion ] [PP in ] [NP September ] .Steve Abney, Parsing by Chunks

– The chunks are non-recursive structures which can be handled by finite-state methods (CFGs)

• Why do text chunking?– Full parsing is expensive, and is not very robust.– Partial parsing much faster, more robust, sufficient for many applications

(IE, QA). – Can also serve as a possible first step for full parsing

http://citeseer.nj.nec.com/cache/papers/cs/26983/http:zSzzSzwww.ai.mit.eduzSzpeoplezSzjimmylinzSzpaperszSzAbney91.pdf/abney91parsing.pdf


Chunking: Rule-based

• Quite high performance on NP chunking can be obtained with a small number of regular expressions

• With a larger rule set, using Constraint Grammar rules, Voutilainen reports recall of 98%+ with precison of 95-98% for noun chunks.– Atro Voutilainen, NPtool, a Detector of English Noun Phrases,

WVLC 93.

http://acl.ldc.upenn.edu/W/W93/W93-0306.pdf


Why chunking can be difficult?• Two major sources of error (and these are also error sources for simple finite-state

patterns for baseNP): participles and conjunction. • Whether a particple is part of a noun phrase will depend on the particular choice of

wordsHe enjoys writing letters.He sells writing paper.

– and sometimes is genuinely ambiguous ...He enjoys baking potatoes.He has broken bottles in the basement.

• The rules for conjoined NPs are complicated by the bracketing rules of the Penn Tree Bank. – Conjoined prenominal nouns are generally treated as part of a single baseNP:

"brick and mortar university" (with "brick and mortar" modifying "university"). – Conjoined heads with shared modifiers are also to be treated as a single baseNP:

"ripe apples and bananas"; • If the modifier is not shared, there are two baseNPs:

"ripe apples and cinnamon". • Modifier sharing, however, is hard for people to judge and is not consistently annotated


Transformation-Based Learning for Chunking

• Adapted the TBL method from Brill for POS tagger. One-level NP chunking restated as a word tagging task.

• Used 3 tags: I (inside a baseNP)

O (outside a baseNP)B (the start of a baseNP which immediately follows another baseNP)

• Initial tags assigned based on the most likely tag for a given part-of-speech. • The contexts for TBL rules:

words, part-of-speech assignments, and prior IOB tags.

Ramshaw & Marcus, Text Chunking using Transformation-Based Learning, WVLC 1995



TBL-based Chunking (2)

• Results can be scored based on the correct assignment of tags, or on recall and precision of complete baseNPs. – The latter is normally used as the metric, since it corresponds to the

actual objective -- different tag sets can be used as an intermediate representation.

– Obtained about 92% recall and precision with their system for baseNPs, using 200K words of training.

– Without lexical information: 90.5% recall and precision.

Ramshaw & Marcus, Text Chunking using Transformation-Based Learning, WVLC 1995



Chunking: Classification-based

• Classification task:– NP or not NP?

• Using classifiers for Chunking– The best performance on the base NP and chunking tasks was

obtained using a Support Vector Machine method. – They obtained an accuracy of 94.22% with the small data set of

Ramshaw and Marcus, and 95.77% by training on almost the entire Penn Treebank.

Taku Kudo; Yuji Matsumoto. Chunking with Support Vector Machines Proc. NAACL 01.

http://acl.ldc.upenn.edu/N/N01/N01-1025.pdf


Hand-tuning vs. Machine Learning• BaseNP chunking is a task for which people (with some linguistics

training) can write quite good rules quickly. • This raises the practical question of whether we should be using

machine learning at all. – If there is already a large relevant resource, it makes sense to learn

from it. – However, if we have to develop a chunker for a new language, is it

cheaper to annotate some data or to write the rules directly?

• Ngai and Yarowsky addressed this question.– They also considered selecting the data to be annotated. – Traditional training is based on sequential text annotation ... we just

annotate a series of documents in sequence. – Can we do better?

• Ngai, G. and D. Yarowsky, Rule Writing or Annotation: Cost-efficient Resource Usage for Base Noun Phrase Chunking. ACL 2000

http://acl.ldc.upenn.edu/P/P00/P00-1016.pdf

http://acl.ldc.upenn.edu/P/P00/P00-1016.pdf


Active Learning

• Instead of annotating training examples sequentially, choose good examples

• Usually, choose examples “on the boundary” – i.e., for which classifier has low confidence

• Very often allows training to converge much faster than sequential/batch learning.

• Drawback: requires user in the loop.

Ngai &Yarowsky ACL 2000


Active Learning (continued)Ngai &Yarowsky ACL 2000


Rule Writing vs. Active LearningNgai &Yarowsky ACL 2000


Rule Writing vs. Annotation Learning

• Annotation:– Can continue infinitely– Can combine efforts of multiple annotators– More consistent results– Accuracy can be improved by better learning algs.

• Rule writing– Must keep in mind rule interactions– Difficult to combine rules from different experts– Requires more skills– Accuracy limited by set of rules (will not improve ever)

Ngai &Yarowsky ACL 2000


The parsing problem

PARSER

Grammar

scorer

correct test trees

testsentences

accuracy

Recent parsers quite accurate

(Eisner, Collins, Charniak, etc.)


Applications of parsing (1/2) Machine translation (Alshawi 1996, Wu 1997, ...)

English Chinesetree

operations

Speech synthesis from parses (Prevost 1996)

The government plans to raise income tax. The government plans to raise income tax the imagination.

Speech recognition using parsing (Chelba et al 1998)

Put the file in the folder.

Put the file and the folder.


Applications of parsing (2/2) Grammar checking (Microsoft)

Indexing for information retrieval (Woods 1997)

... washing a car with a hose ... vehicle maintenance

Information extraction (Hobbs 1996)

NY Times archive

Database

query


Parsing for the Turing Test

Most linguistic properties are defined over trees. One needs to parse to see subtle distinctions. E.g.:

Sara dislikes criticism of her. (her Sara)

Sara dislikes criticism of her by anyone. (her Sara)

Sara dislikes anyone’s criticism of her. (her = Sara or her Sara)


What makes a good grammar

• Conjunctions must match– I ate a hamburger and on the stove.– I ate a cold hot dog and well burned.– I ate the hot dog slowly and a hamburger


Vanilla CFG not sufficient for NL

• Number agreement– a men

• DET selection– a apple

• Tense, mood, etc. agreement

• For now, let’s what it would take to parse English with vanilla CFG


Parsing re-defined


Revised CFG


In: cats scratch people with claws


Soundness and Completeness in Parsing


Top-Down Parsing• Top-down parsing is goal directed• A top-down parser starts with a list of constituents to be

built.• The top-down parser rewrites the goals in the goal list by

matching one against the LHS of the grammar rules, and expanding it with the RHS, attempting to match the sentence to be derived.

• If a goal can be rewritten in several ways, then there is a choice of which rule to apply (search problem)

• Can use depth-first or breadth-first search, and goal ordering.


Simple Top-down parsing algorithm1. Start with initial state ((S) 1) and no backup states.

2. Select current state: Take the first state off possibilities list and call it C.

3. If the possibilities list is empty then the algorithm fails (that is, no successful parse is possible).

4. If C consists of an empty symbol list and the word position is at the end of the sentence then the algorithm succeeds. Otherwise, generate the next possible states.

5. If the first symbol on the symbol list of C is a lexical symbol, and the next word in the sentence can be in that class, then create a new state by removing the first symbol from the symbol list and updating the

word position, and add it to the possibilities list. Otherwise, if the first symbol on the symbol list of C is a non-terminal, generate a new state

for each rule in the grammar that can rewrite that nonterminal symbol and add them all to the possibilities list.


Top-down as search

• For a depth-first strategy, the possibilities list is a stack. In other words, step 1 always takes the first element off the list, and step 3 always puts the new states on the front of the list, yielding a last-in first-out (LIFO) strategy.

• In contrast, in a breadth-first strategy the possibilities list is manipulated as a queue. Step 3 adds the new positions onto the end of the list, rather than the beginning, yielding a first-in first-out (FIR)) strategy.


Top-down example• Grammer: same CFG as before• Lexicon:

– cried: V– dogs: N, V– the: ART

• Input: The/1 dogs/2 cried/3• A typical parse state:

– ((N VP) 2)– Parser needs to find N followed by VP, starting at

position 2


Parsing “The dogs cried”Step

Current State Backup States Comment

1. ((S) 1) initial position

2. ((NP VP) 1) rewriting S by rule I

3. ((ART N VP) 1) rewriting NP by rules 2 & 3

((ART ADJN VP) I)

4. ((N VP) 2) matching ART with the

((ART ADJ N VP) 1)

5. ((VP) 3) matching N with dogs

((ART ADJ N VP) 1)

6. ((V) 3) rewriting VP by rules 5—8

((V NP) 3)

((ART ADJ N VP) 1)

7. the parse succeeds as V is matched to cried, leaving an empty grammatical symbol list with an empty sentence


Problems with Top-down

• Left recursive rules

• A top-down parser will do badly if there are many different rules for the same LHS. Consider if there are 600 rules for S, 599 of which start with NP, but one of which starts with V, and the sentence starts with V.

• Useless work: expands things that are possible top-down but not there

• Top-down parsers do well if there is useful grammar-driven control: search is directed by the grammar

• Top-down is hopeless for rewriting parts of speech (preterminals) with words (terminals). In practice that is always done bottom-up as lexical lookup.

• Repeated work: anywhere there is common substructure


Bottom-up Parsing

• Bottom-up parsing is data directed• The initial goal list of a bottom-up parser is the string to be

parsed. If a sequence in the goal list matches the RHS of a rule, then this sequence may be replaced by the LHS of the rule.

• Parsing is finished when the goal list contains just the start category.

• If the RHS of several rules match the goal list, then there is a choice of which rule to apply (search problem)

• Can use depth-first or breadth-first search, and goal ordering.• The standard presentation is as shift-reduce parsing.


Problems with Bottom-up Parsing• Unable to deal with empty categories: termination problem, unless

rewriting empties as constituents is somehow restricted (but then it’s generally incomplete)

• Useless work: locally possible, but globally impossible.

• Inefficient when there is great lexical ambiguity (grammar-driven control might help here)

• Conversely, it is data-directed: it attempts to parse the words that are there.

• Repeated work: anywhere there is common substructure

• Both TD (LL) and BU (LR) parsers can (and frequently do) do work exponential in the sentence length on NLP problems.

Dynamic Programming for Parsing

• Systematically fill in tables of solutions to sub-problems.

• Store subtrees for each of the various constituents in the input as they are discovered

• Cocke-Kasami-Younger (CKY) algorithm, Early’s algorithm, and chart parsing.

9/20/2010 45CS730: Text Mining for Social Media, F2010

CKY algorithm (BU), recognizer version

Input: string of n words Output: yes/no (since it’s only a recognizer) Data structure: n x n table

rows labeled 0 to n-1 columns labeled 1 to n cell [i,j] lists constituents found between i and j


Miniature Grammar


CKY Example


CKY Algorithm


CKY: Fill last column after “Houston”


CKY Algorithm: Additional Information

• More formal algorithm analysis/description:http://www.cs.uiuc.edu/class/sp09/cs373/lectures/lect_15.pdf

• Online demo: http://homepages.uni-tuebingen.de/student/martin.lazarov/demos/cky.html

9/20/2010 CS730: Text Mining for Social Media, F2010 56

http://www.cs.uiuc.edu/class/sp09/cs373/lectures/lect_15.pdf



http://homepages.uni-tuebingen.de/student/martin.lazarov/demos/cky.html

Feature-Augmented CFGs

• Motivation: Agreement– Most verbs in English can appear in two forms in the present

tense: • form used for third-person, singular subjects (the flight does).

Called 3sg. Has final -s • used for all other kinds of subjects (all the flights do, I do). Let’s

call non-3sg. Usually does not have final -s.– If subject does not agree with verb are ungrammatical:

• *[What flight] leave in the morning?• *Does [NP you] have a flight from Boston to ForthWorth?• *Do [NP this flight] stop in Dallas


Agreement (continued)• Rule for yes-no-questions

– S → Aux NP VP• We could replace with two rules:

– S → 3sgAux 3sgNP VP– S → Non3sgAux Non3sgNP VP

• Also have to add rules for the lexicon:– 3sgAux → does | has | can | . . .– Non3sgAux → do | have | can | . . .

• Also need to add rules for 3sgNP and Non3sgNP:– Make two copies of each rule for NP

• The problem with this method of dealing with number agreement is that it doubles the size of the grammar


Other Agreement Issues

• Head nouns and determiners have to agree:– this flight *this flights– those flights *those flight

• Problems in languages like German or French, which have gender agreement

• Solutions:– Proliferate rules (lots and lots of CFG rules)– Or…


Feature-Augmented CFGs• Number agreement features:

[NUMBER SG]• Adding an additional feature-value pair to capture person

[ NUMBER SG PERSON 3 # ]

• Encode grammatical category of the constituent[ CAT NP NUMBER SG PERSON 3 ]

• Represent 3sgNP category of noun phrases. • Corresponding plural version of this structure:

[ CAT NPNUMBER PLPERSON 3 ]


Features (continued)


Features in Grammar

• CFG rules with constraint features– Example: Number agreement

S → NP VP[ NP NUMBER ] = [ VP NUMBER ]

• Feature augmentations changes CFGs– No longer blind concatenation of non-terminals

• Can be used as filters in Early’s algorithm


Feature Augmentation Key Ideas

• The elements of context-free grammar rules have feature-based constraints associated with them. Shift from atomic grammatical categories to more complex categories with properties.

• The constraints associated with individual rules can refer to, and manipulate, the feature structures associated with the parts of the rule to which they are attached.


Dependency Grammars

– I gave him my address• All links between lexical (word) nodes• About 35 syntactic and semantic relations


Dependency Grammars (continued)

• Advantages:– Free word order

• Implementations:– Link parser (Sleator). Freely available


LEARNING TO PARSE


Parsing in the early 1990s

• The parsers produced detailed, linguistically rich representations

• Parsers had uneven and usually rather poor coverage– E.g., 30% of sentences received no analysis

• Even quite simple sentences had many possible analyses– Parsers either had no method to choose between them or

a very ad hoc treatment of parse preferences• Parsers could not be learned from data• Parser performance usually wasn’t or couldn’t be assessed

quantitatively and the performance of different parsers was often incommensurable


Ambiguity

• John saw Mary – Typhoid Mary– Phillips screwdriver Marynote how rare rules interact

• I see a bird – is this 4 nouns – parsed like “city park scavenger bird”?rare parts of speech, plus systematic ambiguity in noun sequences

• Time flies like an arrow– Fruit flies like a banana– Time reactions like this one– Time reactions like a chemist– or is it just an NP?

The official seat, center of authority, jurisdiction, or office of a bishop


Our bane: Ambiguity

• John saw Mary – Typhoid Mary– Phillips screwdriver Marynote how rare rules interact

• I see a bird – is this 4 nouns – parsed like “city park scavenger bird”?rare parts of speech, plus systematic ambiguity in noun sequences

• Time | flies like an arrow NP VP– Fruit flies | like a banana NP VP– Time | reactions like this one V[stem] NP– Time reactions | like a chemist S PP– or is it just an NP?


May 2007 example…


How to solve this combinatorial explosion of ambiguity?

1. First try parsing without any weird rules, throwing them in only if needed.

2. Better: every rule has a weight. A tree’s weight is total weight of all its rules. Pick the overall lightest parse of sentence.

3. Can we pick the weights automatically Yes: Statistical Parsing


Statistical parsing

• Over the last 12 years statistical parsing has succeeded wonderfully!

• NLP researchers have produced a range of (often free, open source) statistical parsers, which can parse any sentence and often get most of it correct

• These parsers are now a commodity component

• The parsers are still improving.9/20/2010 72CS730: Text Mining for Social Media, F2010

Learning to Parse: A Taste• Penn Treebank project (about 1M words)


Using a Treebank as Grammar

4,500 different rules for expanding VP are separate rules for PP sequences of any length, and every possible arrangement of verb arguments:


Naïve Treebank Grammar• 17,500 distinct rule types.


Spoken Language Syntax• Utterances (vs. sentences)• Much higher rate of pronouns• Repair phenomenon (~40% of sentences)

– use of the words uh and um, word repetitions, restarts, and word fragments (“uh” – most common word)


Treebanks for Speech: LDC Switchboard


Statistical parsing applications

• High precision question answering systems (Pasca and Harabagiu SIGIR 2001)

• Improving biological named entity extraction (Finkel et al. JNLPBA 2004):

• Syntactically based sentence compression (Lin and Wilbur Inf. Retr. 2007)

• Extracting people’s opinions about products (Bloom et al. NAACL 2007)

• Improved interaction in computer games (Gorniak and Roy, AAAI 2005)

• Helping linguists find data (Resnik et al. BLS 2005)


Probabilistic CKY


time 1 flies 2 like 3 an 4 arrow 5

0

NP3Vst3

1NP4VP4

2P2V5

3 Det1

4 N8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP 9/20/2010 80CS730: Text Mining for Social Media, F2010


0

NP3Vst3

1NP4VP4

2P2V5

3 Det1

4 N8


1 VP V NP2 VP VP PP




0

NP3Vst3

NP10

1NP4VP4

2P2V5

3 Det1

4 N8


1 VP V NP2 VP VP PP




0

NP3Vst3

NP10S8

1NP4VP4

2P2V5

3 Det1

4 N8


1 VP V NP2 VP VP PP




0

NP3Vst3

NP10S8S13

1NP4VP4

2P2V5

3 Det1

4 N8


1 VP V NP2 VP VP PP




0

NP3Vst3

NP10S8S13

1NP4VP4

2P2V5

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP




0

NP3Vst3

NP10S8S13

1NP4VP4

2P2V5

PP12

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP




0

NP3Vst3

NP10S8S13

1NP4VP4

2P2V5

PP12VP16

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP




0

NP3Vst3

NP10S8S13

1NP4VP4

NP18

2P2V5

PP12VP16

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP




0

NP3Vst3

NP10S8S13

1NP4VP4

NP18S21

2P2V5

PP12VP16

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP




0

NP3Vst3

NP10S8S13

1NP4VP4

NP18S21VP18

2P2V5

PP12VP16

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP




0

NP3Vst3

NP10S8S13

NP24

1NP4VP4

NP18S21VP18

2P2V5

PP12VP16

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP




0

NP3Vst3

NP10S8S13

NP24S22

1NP4VP4

NP18S21VP18

2P2V5

PP12VP16

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP




0

NP3Vst3

NP10S8S13

NP24S22S27

1NP4VP4

NP18S21VP18

2P2V5

PP12VP16

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP




0

NP3Vst3

NP10S8S13

NP24S22S27NP24

1NP4VP4

NP18S21VP18

2P2V5

PP12VP16

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP




0

NP3Vst3

NP10S8S13

NP24S22S27NP24S27

1NP4VP4

NP18S21VP18

2P2V5

PP12VP16

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP


0 PP P NP 9/20/2010 10

0CS730: Text Mining for Social Media, F2010


0

NP3Vst3

NP10S8S13

NP24S22S27NP24S27S22

1NP4VP4

NP18S21VP18

2P2V5

PP12VP16

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP


0 PP P NP 9/20/2010 10



0

NP3Vst3

NP10S8S13

NP24S22S27NP24S27S22S27

1NP4VP4

NP18S21VP18

2P2V5

PP12VP16

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP


0 PP P NP 9/20/2010 10



0

NP3Vst3

NP10S8S13

NP24S22S27NP24S27S22S27

1NP4VP4

NP18S21VP18

2P2V5

PP12VP16

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP


0 PP P NP

SFollow backpointers …

9/20/2010 103

CS730: Text Mining for Social Media, F2010


0

NP3Vst3

NP10S8S13

NP24S22S27NP24S27S22S27

1NP4VP4

NP18S21VP18

2P2V5

PP12VP16

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP


0 PP P NP

S

NP VP

9/20/2010 104



0

NP3Vst3

NP10S8S13

NP24S22S27NP24S27S22S27

1NP4VP4

NP18S21VP18

2P2V5

PP12VP16

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP


0 PP P NP

S

NP VP

VP PP

9/20/2010 105



0

NP3Vst3

NP10S8S13

NP24S22S27NP24S27S22S27

1NP4VP4

NP18S21VP18

2P2V5

PP12VP16

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP


0 PP P NP

S

NP VP

VP PP

P NP

9/20/2010 106



0

NP3Vst3

NP10S8S13

NP24S22S27NP24S27S22S27

1NP4VP4

NP18S21VP18

2P2V5

PP12VP16

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP


0 PP P NP

S

NP VP

VP PP

P NP

Det N

9/20/2010 107



0

NP3Vst3

NP10S8S13

NP24S22S27NP24S27S22S27

1NP4VP4

NP18S21VP18

2P2V5

PP12VP16

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP


0 PP P NP

Which entries do we need?

9/20/2010 108



0

NP3Vst3

NP10S8S13

NP24S22S27NP24S27S22S27

1NP4VP4

NP18S21VP18

2P2V5

PP12VP16

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP


0 PP P NP

Which entries do we need?

9/20/2010 109



0

NP3Vst3

NP10S8S13

NP24S22S27NP24S27S22S27

1NP4VP4

NP18S21VP18

2P2V5

PP12VP16

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP


0 PP P NP

Not worth keeping …

9/20/2010 110



0

NP3Vst3

NP10S8S13

NP24S22S27NP24S27S22S27

1NP4VP4

NP18S21VP18

2P2V5

PP12VP16

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP


0 PP P NP

… since it just breeds worse options

9/20/2010 111



0

NP3Vst3

NP10S8S13

NP24S22S27NP24S27S22S27

1NP4VP4

NP18S21VP18

2P2V5

PP12VP16

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP


0 PP P NP

Keep only best-in-class!

inferior stock

9/20/2010 112



NP3Vst3

NP10S8

NP24S22

1NP4VP4

NP18S21VP18

2P2V5

PP12VP16

3 Det1

NP10

4 N8


1 VP V NP2 VP VP PP


0 PP P NP

Keep only best-in-class!(and backpointers so you can recover parse)

9/20/2010 113


Probabilistic Trees

• Instead of lightest weight tree, take highest probability tree• Given any tree, a generator should have some probability of

producing it! • Just like using n-grams to choose among strings …• What is the probability of this tree?

S

NPtime

VP

VPflies

PP

Plike

NP

Detan

N arrow9/20/2010 114CS730: Text Mining for Social Media, F2010

Probabilistic or stochastic context-free grammars (PCFGs)• G = (T, N, S, R, P)

– T is set of terminals– N is set of nonterminals

• For NLP, we usually distinguish a set P N of preterminals, which always rewrite as terminals

• S is the start symbol (one of the nonterminals)• R is rules/productions of the form X , where X is a

nonterminal and is a sequence of terminals and nonterminals (possibly an empty sequence)

• P(R) gives the probability of each rule.

• A grammar G generates a language model L.

X N, P(X ) 1X R

P() 1T *


The probability of trees and strings

• P(t) -- The probability of tree is the product of the probabilities of the rules used to generate it.

• P(w1n) -- The probability of the string is the sum of the probabilities of the trees which have that string as their yield

P(w1n) = Σj P(w1n, t) where t is a parse of w1n

= Σj P(t) 9/20/2010 116CS730: Text Mining for Social Media, F2010

A Simple PCFG (in CNF)

S NP VP 1.0 VP V NP 0.7VP VP PP 0.3PP P NP 1.0P with 1.0V saw 1.0

NP NP PP 0.4 NP astronomers 0.1 NP ears 0.18 NP saw 0.04 NP stars 0.18 NP telescope 0.1


9/20/2010 118


9/20/2010 119


Tree and String Probabilities• w15 = astronomers saw stars with ears• P(t1) = 1.0 * 0.1 * 0.7 * 1.0 * 0.4 * 0.18 * 1.0 * 1.0 * 0.18 = 0.0009072• P(t2) = 1.0 * 0.1 * 0.3 * 0.7 * 1.0 * 0.18 * 1.0 * 1.0 * 0.18 = 0.0006804 • P(w15) = P(t1) + P(t2)

= 0.0009072 + 0.0006804 = 0.0015876


Chomsky Normal Form• All rules are of the form X Y Z or X w.• A transformation to this form doesn’t change the

generative capacity of CFGs.– With some extra book-keeping in symbol names, you

can even reconstruct the same trees with a detransform

– Unaries/empties are removed recursively– N-ary rules introduce new nonterminals:

• VP V NP PP becomes VP V @VP-V and @VP-V NP PP

• In practice it’s a pain– Reconstructing n-aries is easy– Reconstructing unaries can be trickier

• But it makes parsing easier/more efficient9/20/2010 121CS730: Text Mining for Social Media, F2010

N-ary Trees in Treebank

Lexicon and Grammar

Binary Trees

TreeAnnotations.annotateTree

Parsing

Treebank binarization

9/20/2010 122


ROOT

S

NP VP

N

cats

V NP PP

P NP

clawswithpeoplescratch

NN

An example: before binarization…


P

NP

claws

N

@PP->_P

with

NP

N

cats peoplescratch

N

VP

V NP PP

@VP->_V

@VP->_V_NP

ROOT

S

@S->_NP

After binarization..


Probabilistic Trees

• Instead of lightest weight tree, take highest probability tree• Given any tree, your assignment 1 generator would have

some probability of producing it! • Just like using n-grams to choose among strings …• What is the probability of this tree?

S

NPtime

VP

VPflies

PP

Plike

NP

Detan

N arrow9/20/2010 125CS730: Text Mining for Social Media, F2010

Already have a CKY alg for weights …

S

NPtime

VP

VPflies

PP

Plike

NP

Detan

N arrow

w( | S) = w(S NP VP) + w(NP flies | NP)

+ w(VP VP NP)

+ w(VP flies) + …

Just let w(X Y Z) = -log p(X Y Z | X)Then lightest tree has highest prob9/20/2010 12


Pruning for Speed

• Heuristically throw away constituents that probably won’t make it into best complete parse.

• Use probabilities to decide which ones. – So probs are useful for speed as well as accuracy!

• Both safe and unsafe methods exist– Throw x away if p(x) < 10-200

(and lower this threshold if we don’t get a parse)– Throw x away if p(x) < 100 * p(y)

for some y that spans the same set of words– Throw x away if p(x)*q(x) is small, where q(x) is an estimate of

probability of all rules needed to combine x with the other words in the sentence


Agenda (“Best-First”) Parsing

• Explore best options first– Should get some good parses early on – grab one & go!

• Prioritize constits (and dotted constits)– Whenever we build something, give it a priority

• How likely do we think it is to make it into the highest-prob parse? – usually related to log prob. of that constit– might also hack in the constit’s context, length, etc.– if priorities are defined carefully, obtain an A* algorithm

• Put each constit on a priority queue (heap)• Repeatedly pop and process best constituent.

– CKY style: combine w/ previously popped neighbors. – Earley style: scan/predict/attach as usual. What else?


Preprocessing

• First “tag” the input with parts of speech:– Guess the correct preterminal for each word, using HMMs– Now only allow one part of speech per word– This eliminates a lot of crazy constituents!– But if you tagged wrong you could be hosed

• Raise the stakes: – What if tag says not just “verb” but “transitive verb”? Or “verb

with a direct object and 2 PPs attached”? (“supertagging”)

• Safer to allow a few possible tags per word, not just one …


How good are PCFGs?

• Robust (usually admit everything, but with low probability)

• Partial solution for grammar ambiguity: a PCFG gives some idea of the plausibility of a sentence

• But not so good because the independence assumptions are too strong

• Give a probabilistic language model – But in a simple case it performs worse than a trigram model

• The problem seems to be it lacks the lexicalization of a trigram model


Lexicalization

• Lexical heads are important for certain classes of ambiguities (e.g., PP attachment):

• Lexicalizing grammar creates a much larger grammar.– Sophisticated smoothing needed– Smarter parsing algorithms needed– More data needed

• How necessary is lexicalization?– Bilexical vs. monolexical selection– Closed vs. open class lexicalization


Lexicalized Parsing

• peel the apple on the towel– ambiguous

• put the apple on the towel– on attached to put (is the other reading even possible?)

• put the apple on the towel in the box• VP[head=put] V[head=put] NP PP• VP[head=put] V[head=put] NP PP[head=on]

• study the apple on the towel– study dislikes on (how can the PCFG express this?)

• VP[head=study] VP[head=study] PP[head=on]• study it on the towel

– it dislikes on even more – PP can’t attach to pronoun


Lexicalized Parsing

• the plan that Natasha would swallow – ambiguous between content of plan and relative clause

• the plan that Natasha would snooze – snooze dislikes a direct object (plan)

• the plan that Natasha would make– make likes a direct object (plan)

• the pill that Natasha would swallow– pill can’t express a content-clause the way plan does– pill is a probable direct object for swallow

• How to express these distinctions in a CFG or PCFG?


Putting words into PCFGs

• A PCFG uses the actual words only to determine the probability of parts-of-speech (the preterminals)

• In many cases we need to know about words to choose a parse

• The head word of a phrase gives a good representation of the phrase’s structure and meaning– Attachment ambiguities

The astronomer saw the moon with the telescope

– Coordination: the dogs in the house and the cats

– Subcategorization frames: put versus like


(Head) Lexicalization• put takes both an NP and a VP

– Sue put [ the book ]NP [ on the table ]PP

– * Sue put [ the book ]NP

– * Sue put [ on the table ]PP

• like usually takes an NP and not a PP– Sue likes [ the book ]NP

– * Sue likes [ on the table ]PP

• We can’t tell this if we just have a VP with a verb, but we can if we know what verb it is


(Head) Lexicalization

• Collins 1997, Charniak 1997• Puts the properties of words into a PCFG

Swalked

NPSue VPwalked

Sue Vwalked PPinto

walked Pinto NPstore

into DTthe NPstore

the store9/20/2010 139CS730: Text Mining for Social Media, F2010

Lexicalization sharpens probabilities: rule expansion

Local Tree come take think want

VP V 9.5% 2.6% 4.6% 5.7%

VP V NP 1.1% 32.1% 0.2% 13.9%

VP V PP 34.5% 3.1% 7.1% 0.3%

VP V SBAR 6.6% 0.3% 73.0% 0.2%

VP V S 2.2% 1.3% 4.8% 70.8%

VP V NP S 0.1% 5.7% 0.0% 0.3%

VP V PRT NP 0.3% 5.8% 0.0% 0.0%

VP V PRT PP 6.1% 1.5% 0.2% 0.0%

• E.g., probability of different verbal complement frames (often called “subcategorizations”)


Lexicalization sharpens probabilities: Predicting heads

“Bilexical probabilities”

• p(prices | n-plural) = .013• p(prices | n-plural, NP) = .013• p(prices | n-plural, NP, S) = .025• p(prices | n-plural, NP, S, v-past) = .052• p(prices | n-plural, NP, S, v-past, fell) = .146


Naïve Lexicalized Parsing

• Can, in principle, use CKY on lexicalized PCFGs– O(Rn3) time and O(Sn2) memory– But R = rV2 and S = sV– Result is completely impractical (why?)– Memory: 10K rules * 50K words * (40 words)2 * 8 bytes ≈ 6TB

• Can modify CKY to exploit lexical sparsity– Lexicalized symbols are a base grammar symbol and a pointer

into the input sentence, not any arbitrary word– Result: O(rn5) time, O(sn3)– Memory: 10K rules * (40 words)3 * 8 bytes ≈ 5GB


Charniak (1997) linear interpolation/shrinkage


Charniak (1997) shrinkage example


Lexicalized Parsing was seen as the breakthrough of the late 90s

• Eugene Charniak, 2000 JHU workshop: “To do better, it is necessary to condition probabilities on the actual words of the sentence. This makes the probabilities much tighter:– p(VP V NP NP) = 0.00151– p(VP V NP NP | said) = 0.00001– p(VP V NP NP | gave) = 0.01980 ”

• Michael Collins, 2003 COLT tutorial: “Lexicalized Probabilistic Context-Free Grammars … perform vastly better than PCFGs (88% vs. 73% accuracy)”


Lexicalized parsing results (Labeled Constituent Precision/Recall F1)

• Demo: http://nlp.stanford.edu:8080/parser/


http://nlp.stanford.edu:8080/parser/

http://nlp.stanford.edu:8080/parser/

Sparseness & the Penn Treebank

• The Penn Treebank – 1 million words of parsed English WSJ – has been a key resource (because of the widespread reliance on supervised learning)

• But 1 million words is like nothing:– 965,000 constituents, but only 66 WHADJP, of which

only 6 aren’t how much or how many, but there is an infinite space of these

• How clever/original/incompetent (at risk assessment and evaluation) …

• Most of the probabilities that you would like to compute, you can’t compute


Sparseness & the Penn Treebank (2)

• Many parse preferences depend on bilexical statistics: likelihoods of relationships between pairs of words (compound nouns, PP attachments, …)

• Extremely sparse, even on topics central to the WSJ:– stocks plummeted 2 occurrences– stocks stabilized 1 occurrence– stocks skyrocketed 0 occurrences– #stocks discussed 0 occurrences

• So far there has been very modest success in augmenting the Penn Treebank with extra unannotated materials or using semantic classes – once there is more than a little annotated training data. – Cf. Charniak 1997, Charniak 2000; but see McClosky et al. 2006


Motivating discriminative parsing

• In discriminative models, it is easy to incorporate different kinds of features– Often just about anything that seems linguistically

interesting• In generative models, it’s often difficult, and

the model suffers because of false independence assumptions

• This ability to add informative features is the real power of discriminative models for NLP.


Discriminative Parsers

• Discriminative Dependency Parsing– Not as computationally hard (tiny grammar constant)– Explored considerably recently. E.g. McDonald et al. 2005

• Make parser action decisions discriminatively– E.g. with a shift-reduce parser

• Dynamic program Phrase Structure Parsing– Resource intensive! Most work on sentences of length <=15– The need to be able to dynamic program limits the feature types you can

use

• Post-Processing: Parse reranking– Just work with output of k-best generative parser

1. Distribution-free methods2. Probabilistic model methods


Charniak and Johnson (ACL 2005):Coarse-to-fine n-best parsing and MaxEnt discriminative reranking

• Builds a maxent discriminative reranker over parses produced by (a slightly bugfixed and improved version of) Charniak (2000).

• Gets 50 best parses from Charniak (2000) parser– Doing this exploits the “coarse-to-fine” idea to heuristically find good

candidates

• Maxent model for reranking uses heads, etc. as generative model, but also nice linguistic features:– Conjunct parallelism– Right branching preference– Heaviness (length) of constituents factored in

• Gets 91% LP/LR F1 (on all sentences! – up to 80 wd)


Readings for Next Week

• FSNLP Chapters 11 and 12


cs 730: text mining for social media & collaboratively generated content

Documents

social media research

f201013parsing 9202010cs730

lion 9202010cs730

syntactic parse 9202010cs730

predicate noun phrase

particlesnoun phrase

appositionprepositional

rewrite rules 9202010cs730