ling/c sc 581: advanced computational linguistics lecture notes feb 5 th

43
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Upload: ruth-riley

Post on 29-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

LING/C SC 581: Advanced Computational Linguistics

Lecture NotesFeb 5th

Page 2: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Today's Topics

• A note on java• Tregex homework discussion• Treebanks and Statistical parsers• Homework– One more exercise on tregex– Install the Bikel-Collins Parser

Page 3: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Java and tregex on OSX

• If you're running on Mavericks (10.9), Java 7 is running by default.

• But the latest tregex (3.5.1) requires Java 8:– Unsupported major.minor version 52.0

• Solution:– Install Java 8 (JRE) from Oracle directly– It installs in /Library/Internet\ Plug-Ins/– Modify the path to java in run-tregex-gui.command as follows:

#!/bin/sh/Library/Internet\ Plug-Ins/JavaAppletPlugin.plugin/Contents/Home/bin/java -mx300m -cp `dirname $0`/stanford-tregex.jar edu.stanford.nlp.trees.tregex.gui.TregexGUI

Page 4: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Homework Discussion

• useful command line tool– diff <file1> <file2>

Page 5: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Homework Discussion• page 268 of PRSGUID1.PDF

http://www.clips.ua.ac.be/pages/mbsp-tags

Functional tag -CLF indicates a true cleft

Page 6: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Homework Discussion

• Wish: – every construction was marked with its own tag

• So – -CLF looks easy …

Page 7: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Homework Discussion

• Types:

that is sometimes WHNP

Page 8: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Homework Discussion

• There are no SQ-CLF and SINV-CLF …

• Gapping?

Page 9: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Homework Discussion

• Search conservatively…– 62: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < (@SBAR)– 41: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < (@SBAR < /WHNP/))wsj_0267.mrg-30 It was also in law school that Mr. O'Kicki and his first wife had the first of seven daughters *T*-1 .

Page 10: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Homework Discussion

• Search conservatively…– 62: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < (@SBAR)– 57: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < (@SBAR < /WH(ADV|NP)/))

wsj_0591.mrg-21 It is partly for this reason that the exchange last week began *-3 trading in its own stock `` basket '' product that *T*-2 allows big investors to buy or sell all 500 stocks in the Standard & Poor 's index in a single trade .

62: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < (@SBAR < /IN|(WH(ADV|NP))/))

Page 11: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Homework Discussion• wsj_1154.mrg-4 It isn't every day that we hear a Violetta who *T*-1 can sing

the first act's high-flying music with all the little notes perfectly pitched *-2 and neatly stitched *-2 together .

No promised temporal trace …

Page 12: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Homework Discussion• 41: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < (@SBAR < /WH(ADV|NP)-[0-9]+/))• 57: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < (@SBAR < /WH(ADV|NP).*-.*[0-

9]+/))wsj_0267.mrg-30 It was also in law school that Mr. O'Kicki and his first wife had the first of seven daughters *T*-1 .

Page 13: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Homework Discussion• 43: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < (@SBAR < /WH(ADV|NP).*-.*([0-

9]+)/#2%i << (/NP-SBJ/ < (/-NONE-/ < /\*T.*([0-9]+)/))))• 39: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < (@SBAR < /WH(ADV|NP).*-.*([0-

9]+)/#2%i << (/NP-SBJ/ < (/-NONE-/ < /\*T.*([0-9]+)/#1%i))))wsj_1655.mrg-4 Still , it was in Argentine editions that his countrymen first read his story of Pascal Duarte , a field worker who *T*-1 stabbed his mother to death and has no regrets as he awaits his end in a prison cell *T*-4

Page 14: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Homework Discussion

• Wh-clefts

Page 15: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Homework Discussion

• Wh-clefts:– 28: /SBAR-NOM/ $+ @VP << /.*-PRD/

wsj_0415.mrg-5 Who that winner will be *T*-1 is highly uncertain .

Page 16: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Relevance of Treebanks

• Statistical parsers typically construct syntactic phrase structure– they’re trained on Treebank corpora like the Penn

Treebank• Note: some use dependency graphs, not trees

Page 17: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Parsers trained on the Treebank

• Don’t recover fully-annotated trees– not trained using nodes with indices or empty (-NONE-) nodes– not trained using functional tags, e.g. –SBJ

• Therefore they don’t fully parse• Example: no SBAR node in … a movie to see

Stanford parser

Page 18: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Parsers trained on the Treebank

• SBAR can be forced by the presence of an overt relative pronoun, but note there is no subject gap:

Page 19: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Parsers trained on the Treebank

• Probabilities are estimated from frequency information of each node given surrounding context (e.g. parent node, or the word that heads the node)

• Still these systems have enormous problems with prepositional phrase (PP) attachment

• Example:(borrowed from Igor Malioutov)

– A boy with a telescope kissed Mary on the lips– Mary was kissed by a boy with a telescope on the lips

• PP with a telescope should adjoin to the noun phrase (NP) a boy• PP on the lips should adjoin to the verb phrase (VP) headed by

kiss

Page 20: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Active/passive sentences

• Examples using the Stanford Parser:

Both active and passivesentences are parsed incorrectly

Page 21: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Active/passive sentences

• Examples:

X on the lips modifies MaryX on the lips modifies telescope

Page 22: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Homework Exercise• Use tregex to find out how many passive sentences there are in the

Treebank WSJ section?• Report your search formula and frequency count• The passive construction (according to the Bracketing Guidelines)

– Note: by-phrase containing logical subject (LGS) is optional

Page 23: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Treebank Rules

• Just how many rules are there in the WSJ treebank?

• What’s the most common POS tag?• What’s the most common syntax rule?

Page 24: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Treebank

NN IN NNP DT -NONE- JJ NNS , . CD RB VBD VB CC TO VBZ VBN PRP VBG VBP MD POS PRP$0

20000

40000

60000

80000

100000

120000

140000

160000

180000

WSJ POS tag frequencies

POS

Freq

uenc

y

Total # of tags: 1,253,013

Page 25: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Treebank

NN IN NNP DT -NONE- JJ NNS , . CD RB VBD VB CC TO VBZ VBN PRP VBG VBP MD POS PRP$0.0%

2.0%

4.0%

6.0%

8.0%

10.0%

12.0%

14.0%

WSJ POS tags

POS

Perc

enta

ge

Category Frequency PercentageN 355039 28.3%V 154975 12.4%

Page 26: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Treebank

S->NP-SB

J+VP

PP->IN+N

P

NP-SBJ->

-NONE

NP->NP+P

P

NP->DT+

NN

S->NP-SB

J+VP+.

NP-SBJ->

PRP

VP->TO+V

P

PP-LOC->I

N+NP

NP->-NONE

NP->NN

NP->NNS

PP-CLR->I

N+NP

NP->DT+

JJ+NN

NP->NNP

VP->MD+V

P

SBAR->W

HNP+S

VP->VB+N

P

SBAR->-

NONE+S

PP-TMP->I

N+NP

ADVP->RB

NP->NNP+N

NP

NP->JJ+N

NS

NP-SBJ->

DT+NN

VP->VBD+S

BAR

NP-SBJ->

NP+PP

SBAR->I

N+S

S->-N

ONE

VP->VBZ+

VP

NP-SBJ->

NNP+NNP

0

10000

20000

30000

40000

50000

60000

Treebank Grammar Rules

Rule

Freq

uenc

y

Total # of rules: 978,873# of different rules: 31,338

Page 27: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Treebank

S->NP-SB

J+VP

PP->IN+N

P

NP-SBJ->

-NONE

NP->NP+P

P

NP->DT+

NN

S->NP-SB

J+VP+.

NP-SBJ->

PRP

VP->TO+V

P

PP-LOC->I

N+NP

NP->-NONE

NP->NN

NP->NNS

PP-CLR->I

N+NP

NP->DT+

JJ+NN

NP->NNP

VP->MD+V

P

SBAR->W

HNP+S

VP->VB+N

P

SBAR->-

NONE+S

PP-TMP->I

N+NP

ADVP->RB

NP->NNP+N

NP

NP->JJ+N

NS

NP-SBJ->

DT+NN

VP->VBD+S

BAR

NP-SBJ->

NP+PP

SBAR->I

N+S

S->-N

ONE

VP->VBZ+

VP

NP-SBJ->

NNP+NNP

0.0%

1.0%

2.0%

3.0%

4.0%

5.0%

6.0%

Treebank Grammar Rules

Rule

Perc

enta

ge

Page 28: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Treebank

PP->IN+N

P

S->NP+V

PNP->

NP->NP+P

P

NP->DT+

NN

S->NP+V

P+.

NP->PRP

ADVP->RB

NP->NN

NP->NNS

VP->TO+V

P

NP->NNP

NP->NNP+N

NP

NP->DT+

JJ+NN

SBAR->I

N+S

VP->VB+N

P

SBAR->W

HNP+S

PP->TO+N

P

VP->MD+V

P

NP->JJ+N

NS

SBAR->+

S

VP->VBD+S

BAR

NP->NP+S

BAR

VP->VBN+N

P+PP

NP->DT+

NNS

NP->JJ+N

N S->

VP->VBZ+

VP

VP->VBD+N

P

VP->VBZ+

NP0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

Treebank Grammar Rules

Rules

Freq

uenc

y

Total # of rules: 978,873# of different rules: 17,554

Page 29: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Treebank

PP->IN+N

P

S->NP+V

PNP->

NP->NP+P

P

NP->DT+

NN

S->NP+V

P+.

NP->PRP

ADVP->RB

NP->NN

NP->NNS

VP->TO+V

P

NP->NNP

NP->NNP+N

NP

NP->DT+

JJ+NN

SBAR->I

N+S

VP->VB+N

P

SBAR->W

HNP+S

PP->TO+N

P

VP->MD+V

P

NP->JJ+N

NS

SBAR->+

S

VP->VBD+S

BAR

NP->NP+S

BAR

VP->VBN+N

P+PP

NP->DT+

NNS

NP->JJ+N

N S->

VP->VBZ+

VP

VP->VBD+N

P

VP->VBZ+

NP0.0%

2.0%

4.0%

6.0%

8.0%

10.0%

12.0%

Treebank Grammar Rules

Rules

Perc

enta

ge

Page 30: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Today’s TopicLet’s segue from Treebank search to stochastic parsers trained on the WSJ Penn Treebank

Examples:• Berkeley Parser

– http://tomato.banatao.berkeley.edu:8080/parser/parser.html

• Stanford Parser– http://nlp.stanford.edu:8080/parser/

are all trained on the Treebank.

We’ll play with Bikel’s implementation of Collins’s Parser …

Page 31: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Using the Treebank

• What is the grammar of the Treebank?– We can extract the phrase structure rules used, and– count the frequency of rules, and construct a

stochastic parser

Page 32: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Using the Treebank

• Breakthrough in parsing accuracy with lexicalized trees– think of expanding the nonterminal names to include head

information and the words that are at the leaves of the subtrees.

Page 33: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Bikel Collins Parser

• Java re-implementation of Collins’ parser• Paper– Daniel M. Bikel. 2004. Intricacies of Collins’ Parsing Model.

(PS) (PDF) in Computational Linguistics, 30(4), pp. 479-511.• Software– http://www.cis.upenn.edu/~dbikel/software.html#stat-pars

er (page no longer exists)

Page 34: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Bikel Collins

• Download and install Dan Bikel’s parser– dbp.zip (on course homepage)

• File: install.sh– Java code– but at this point I think Windows won’t work

because of the shell script (.sh)– maybe after files are extracted?

Page 35: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Bikel Collins

• Download and install the POS tagger MXPOST

parser doesn’t actually need a separate tagger…

Page 36: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Bikel Collins

• Training the parser with the WSJ PTB• See guide – userguide/guide.pdf

directory: TREEBANK_3/parsed/mrg/wsjchapters 02-21: create one single .mrg fileevents: wsj-02-21.obj.gz

Page 37: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Bikel Collins

• Settings:

Page 38: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Bikel Collins• Parsing

– Command

– Input file format (sentences)

Page 39: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Bikel Collins

• Verify the trainer and parser work on your machine

Page 40: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Bikel Collins

• File: bin/parse is a shell script that sets up program parameters and calls java

Page 41: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Bikel Collins

Page 42: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Bikel Collins

• File: bin/train is another shell script

Page 43: LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 5 th

Bikel Collins

• Relevant WSJ PTB files