punctuation: making a point · — e.g., punctuation and capitalization raw word streams often...

121
Punctuation: Making a Point in Unsupervised Dependency Parsing Valentin I. Spitkovsky with Daniel Jurafsky (Stanford University) and Hiyan Alshawi (Google Inc.) Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 1 / 25

Upload: others

Post on 22-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Punctuation: Making a Point

in Unsupervised Dependency Parsing

Valentin I. Spitkovsky

with Daniel Jurafsky (Stanford University)

and Hiyan Alshawi (Google Inc.)

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 1 / 25

Page 2: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Example Raw Text

Example: Raw Word Stream

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 2 / 25

Page 3: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Example Raw Text

Example: Raw Word Stream

ALTHOUGH IT PROBABLY HAS REDUCED THELEVEL OF EXPENDITURES FOR SOME

PURCHASERS UTILIZATION MANAGEMENTLIKE MOST OTHER COST CONTAINMENTSTRATEGIES DOESN’T APPEAR TO HAVE

ALTERED THE LONG-TERM RATE OFINCREASE IN HEALTH-CARE COSTS THE

INSTITUTE OF MEDICINE AN AFFILIATE OFTHE NATIONAL ACADEMY OF SCIENCESCONCLUDED AFTER A TWO-YEAR STUDY

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 2 / 25

Page 4: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Example Unformatted Text

Example:

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 3 / 25

Page 5: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Example Unformatted Text

Example:

formatting (missing structural cues):

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 3 / 25

Page 6: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Example Unformatted Text

Example:

formatting (missing structural cues):— e.g., punctuation and capitalization

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 3 / 25

Page 7: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Example Unformatted Text

Example:

formatting (missing structural cues):— e.g., punctuation and capitalization

raw word streams often difficult even for humans— e.g., transcribed utterances (Kim and Woodland, 2002)

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 3 / 25

Page 8: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Example Unlexicalized Tokens

Example:

IN PRP RB VBZ VBN DT NN IN NNS IN DTNNS NN NN IN RBS JJ NN NN NNS VBZ RBVB TO VB VBN DT JJ NN IN NN IN JJ NNSDT NNP IN NNP DT NN IN DT NNP NNP IN

NNPS VBD IN DT JJ NN

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 4 / 25

Page 9: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Example Formatted Text

Example:

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 5 / 25

Page 10: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Example Formatted Text

Example:

[SBAR Although it probably has reduced the level ofexpenditures for some purchasers],

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 5 / 25

Page 11: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Example Formatted Text

Example:

[SBAR Although it probably has reduced the level ofexpenditures for some purchasers], [NP utilization

management] —

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 5 / 25

Page 12: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Example Formatted Text

Example:

[SBAR Although it probably has reduced the level ofexpenditures for some purchasers], [NP utilization

management] — [PP like most other costcontainment strategies] —

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 5 / 25

Page 13: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Example Formatted Text

Example:

[SBAR Although it probably has reduced the level ofexpenditures for some purchasers], [NP utilization

management] — [PP like most other costcontainment strategies] — [VP doesn’t appear to

have altered the long-term rate of increase inhealth-care costs],

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 5 / 25

Page 14: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Example Formatted Text

Example:

[SBAR Although it probably has reduced the level ofexpenditures for some purchasers], [NP utilization

management] — [PP like most other costcontainment strategies] — [VP doesn’t appear to

have altered the long-term rate of increase inhealth-care costs], [NP the Institute of Medicine],

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 5 / 25

Page 15: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Example Formatted Text

Example:

[SBAR Although it probably has reduced the level ofexpenditures for some purchasers], [NP utilization

management] — [PP like most other costcontainment strategies] — [VP doesn’t appear to

have altered the long-term rate of increase inhealth-care costs], [NP the Institute of Medicine],

[NP an affiliate of the National Academy of

Sciences],

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 5 / 25

Page 16: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Example Formatted Text

Example:

[SBAR Although it probably has reduced the level ofexpenditures for some purchasers], [NP utilization

management] — [PP like most other costcontainment strategies] — [VP doesn’t appear to

have altered the long-term rate of increase inhealth-care costs], [NP the Institute of Medicine],

[NP an affiliate of the National Academy of

Sciences], [VP concluded after a two-year study].

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 5 / 25

Page 17: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Intuition Strong Cues

Intuition:

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 6 / 25

Page 18: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Intuition Strong Cues

Intuition:

punctuation is a strong structural cue

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 6 / 25

Page 19: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Intuition Strong Cues

Intuition:

punctuation is a strong structural cue— demarcates separable fragments

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 6 / 25

Page 20: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Intuition Strong Cues

Intuition:

punctuation is a strong structural cue— demarcates separable fragments

we will make simplifying independence assumptions

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 6 / 25

Page 21: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Intuition Strong Cues

Intuition:

punctuation is a strong structural cue— demarcates separable fragments

we will make simplifying independence assumptions— (unreasonably) strong in training

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 6 / 25

Page 22: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Intuition Strong Cues

Intuition:

punctuation is a strong structural cue— demarcates separable fragments

we will make simplifying independence assumptions— (unreasonably) strong in training

less crude in inference— (reasonably) weak in final decoding

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 6 / 25

Page 23: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Intuition Strong Assumption

Intuition:

strong constraint

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 7 / 25

Page 24: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Intuition Strong Assumption

Intuition:

strong constraint: (head ← head) in training

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 7 / 25

Page 25: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Intuition Strong Assumption

Intuition:

strong constraint: (head ← head) in training

word head , head word word ,

head word word word word word word word .

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 7 / 25

Page 26: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Intuition Strong Assumption

Intuition:

strong constraint: (head ← head) in training

word head , head word word ,

head word word word word word word word .

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 7 / 25

Page 27: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Intuition Strong Assumption

Intuition:

strong constraint: (head ← head) in training

word head , head word word ,

head word word word word word word word .

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 7 / 25

Page 28: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Intuition Strong Assumption

Intuition:

strong constraint: (head ← head) in training

Other countries , including West Germany ,

may have a hard time justifying continued membership .

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 7 / 25

Page 29: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Intuition Weak Assumption

Intuition:

weak constraint

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 8 / 25

Page 30: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Intuition Weak Assumption

Intuition:

weak constraint: (head ← external word) in inference

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 8 / 25

Page 31: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Intuition Weak Assumption

Intuition:

weak constraint: (head ← external word) in inference

word word head word word word ,

head word word word word word word word .

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 8 / 25

Page 32: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Intuition Weak Assumption

Intuition:

weak constraint: (head ← external word) in inference

word word head word word word ,

head word word word word word word word .

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 8 / 25

Page 33: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Intuition Weak Assumption

Intuition:

weak constraint: (head ← external word) in inference

word word head word word word ,

head word word word word word word word .

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 8 / 25

Page 34: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Intuition Weak Assumption

Intuition:

weak constraint: (head ← external word) in inference

IFI also has nonvoting preferred shares ,

which are quoted on the Milan stock exchange .

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 8 / 25

Page 35: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Linguistic Analysis Constituents

Linguistic Analysis:

punctuation and syntax are related(Nunberg, 1990; Briscoe, 1994; Jones 1994; Doran, 1998, inter alia)

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 9 / 25

Page 36: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Linguistic Analysis Constituents

Linguistic Analysis:

punctuation and syntax are related(Nunberg, 1990; Briscoe, 1994; Jones 1994; Doran, 1998, inter alia)

49.4% of inter-punctuation fragments are constituents

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 9 / 25

Page 37: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Linguistic Analysis Constituents

Linguistic Analysis:

punctuation and syntax are related(Nunberg, 1990; Briscoe, 1994; Jones 1994; Doran, 1998, inter alia)

49.4% of inter-punctuation fragments are constituents

lowest dominating non-terminals:%

S 32.5NP 27.2VP 13.3PP 10.1SBAR 6.7ADVP 3.3QP 2.5SINV 2.0ADJP 1.0

98.5

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 9 / 25

Page 38: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Linguistic Analysis Strong Dependencies

Linguistic Analysis:

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 10 / 25

Page 39: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Linguistic Analysis Strong Dependencies

Linguistic Analysis:

strong (in training)

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 10 / 25

Page 40: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Linguistic Analysis Strong Dependencies

Linguistic Analysis:

strong (in training), e.g.,

... arrests followed a “ Snake Day ” at Utrecht ...

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 10 / 25

Page 41: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Linguistic Analysis Strong Dependencies

Linguistic Analysis:

strong (in training), e.g.,

... arrests followed a “ Snake Day ” at Utrecht ...

— already 74.0% agreement with head-percolated trees

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 10 / 25

Page 42: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Linguistic Analysis Weak Dependencies

Linguistic Analysis:

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 11 / 25

Page 43: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Linguistic Analysis Weak Dependencies

Linguistic Analysis:

weak (in inference)

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 11 / 25

Page 44: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Linguistic Analysis Weak Dependencies

Linguistic Analysis:

weak (in inference), e.g.,

Maryland Club also distributes tea , which ...

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 11 / 25

Page 45: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Linguistic Analysis Weak Dependencies

Linguistic Analysis:

weak (in inference), e.g.,

Maryland Club also distributes tea , which ...

— now 92.9% agreement with head-percolated trees

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 11 / 25

Page 46: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Linguistic Analysis Violations

Linguistic Analysis:

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 12 / 25

Page 47: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Linguistic Analysis Violations

Linguistic Analysis:

generalization:

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 12 / 25

Page 48: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Linguistic Analysis Violations

Linguistic Analysis:

generalization:— no path from the root may enter a fragment twice

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 12 / 25

Page 49: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Linguistic Analysis Violations

Linguistic Analysis:

generalization:— no path from the root may enter a fragment twice— 95.0% agreement with head-percolated trees

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 12 / 25

Page 50: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Linguistic Analysis Violations

Linguistic Analysis:

generalization:— no path from the root may enter a fragment twice— 95.0% agreement with head-percolated trees

simple violations:

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 12 / 25

Page 51: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Linguistic Analysis Violations

Linguistic Analysis:

generalization:— no path from the root may enter a fragment twice— 95.0% agreement with head-percolated trees

simple violations: “seamless” quotations

Her recent report classifies the stock as a “hold.”

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 12 / 25

Page 52: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Linguistic Analysis Violations

Linguistic Analysis:

generalization:— no path from the root may enter a fragment twice— 95.0% agreement with head-percolated trees

simple violations: “seamless” quotations and even lists

Her recent report classifies the stock as a “hold.”

The company said its directors , management and

subsidiaries will remain long-term investors and ...

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 12 / 25

Page 53: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Motivation

Motivation: “Profiting from Markup”

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 13 / 25

Page 54: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Motivation

Motivation: “Profiting from Markup”

..., whereas McCain is secure on the topic, Obama<a>[VP worries about winning the pro-Israel vote]</a>.

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 13 / 25

Page 55: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Motivation

Motivation: “Profiting from Markup”

..., whereas McCain is secure on the topic, Obama<a>[VP worries about winning the pro-Israel vote]</a>.

“Capitalizing on Punctuation”

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 13 / 25

Page 56: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Motivation

Motivation: “Profiting from Markup”

..., whereas McCain is secure on the topic, Obama<a>[VP worries about winning the pro-Israel vote]</a>.

“Capitalizing on Punctuation”— more common (particularly in long sentences)

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 13 / 25

Page 57: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Motivation

Motivation: “Profiting from Markup”

..., whereas McCain is secure on the topic, Obama<a>[VP worries about winning the pro-Israel vote]</a>.

“Capitalizing on Punctuation”— more common (particularly in long sentences)— more uniform (better coverage of constructs)

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 13 / 25

Page 58: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

The Problem Input/Output

Problem: Unsupervised Learning of Parsing

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 14 / 25

Page 59: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

The Problem Input/Output

Problem: Unsupervised Learning of Parsing

Input: Raw Text

... By most measures, the nation’s industrial sector is nowgrowing very slowly — if at all. Factory payrolls fell inSeptember. So did the Federal Reserve ...

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 14 / 25

Page 60: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

The Problem Input/Output

Problem: Unsupervised Learning of Parsing

NN NNS VBD IN NN ♦| | | | | |

Factory payrolls fell in September .

Input: Raw Text (Sentences, Tokens and POS-tags)

... By most measures, the nation’s industrial sector is nowgrowing very slowly — if at all. Factory payrolls fell inSeptember. So did the Federal Reserve ...

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 14 / 25

Page 61: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

The Problem Input/Output

Problem: Unsupervised Learning of Parsing

NN NNS VBD IN NN ♦| | | | | |

Factory payrolls fell in September .

Input: Raw Text (Sentences, Tokens and POS-tags)

... By most measures, the nation’s industrial sector is nowgrowing very slowly — if at all. Factory payrolls fell inSeptember. So did the Federal Reserve ...

Output: Syntactic Structures (and a Probabilistic Grammar)

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 14 / 25

Page 62: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Methodology Scoring

Scoring: Directed Dependency Accuracy

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 15 / 25

Page 63: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Methodology Scoring

Scoring: Directed Dependency Accuracy

NN NNS VBD IN NN ♦| | | | | |

Factory payrolls fell in September .

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 15 / 25

Page 64: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Methodology Scoring

Scoring: Directed Dependency Accuracy

NN NNS VBD IN NN ♦| | | | | |

Factory payrolls fell in September .

Directed score: 35 = 60%

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 15 / 25

Page 65: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Methodology Scoring

Scoring: Directed Dependency Accuracy

NN NNS VBD IN NN ♦| | | | | |

Factory payrolls fell in September .

Directed score: 35 = 60% (right/left-branching baselines: 2

5 = 40%).

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 15 / 25

Page 66: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Methodology Model

DMV: Dependency Model with Valence

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 16 / 25

Page 67: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Methodology Model

DMV: Dependency Model with Valence

a head-outward model, with word classesand valence/adjacency (Klein and Manning, 2004)

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 16 / 25

Page 68: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Methodology Model

DMV: Dependency Model with Valence

a head-outward model, with word classesand valence/adjacency (Klein and Manning, 2004)

h

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 16 / 25

Page 69: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Methodology Model

DMV: Dependency Model with Valence

a head-outward model, with word classesand valence/adjacency (Klein and Manning, 2004)

h

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 16 / 25

Page 70: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Methodology Model

DMV: Dependency Model with Valence

a head-outward model, with word classesand valence/adjacency (Klein and Manning, 2004)

h

a1

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 16 / 25

Page 71: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Methodology Model

DMV: Dependency Model with Valence

a head-outward model, with word classesand valence/adjacency (Klein and Manning, 2004)

h

a1

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 16 / 25

Page 72: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Methodology Model

DMV: Dependency Model with Valence

a head-outward model, with word classesand valence/adjacency (Klein and Manning, 2004)

h

a1

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 16 / 25

Page 73: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Methodology Model

DMV: Dependency Model with Valence

a head-outward model, with word classesand valence/adjacency (Klein and Manning, 2004)

h

a1 a2

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 16 / 25

Page 74: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Methodology Model

DMV: Dependency Model with Valence

a head-outward model, with word classesand valence/adjacency (Klein and Manning, 2004)

h

a1 a2

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 16 / 25

Page 75: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Methodology Model

DMV: Dependency Model with Valence

a head-outward model, with word classesand valence/adjacency (Klein and Manning, 2004)

h

a1 a2

STOP

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 16 / 25

Page 76: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Methodology Model

DMV: Dependency Model with Valence

a head-outward model, with word classesand valence/adjacency (Klein and Manning, 2004)

h

a1 a2

STOP

P(th) =∏

dir∈{L,R}

PSTOP(ch, dir,

adj︷︸︸︷

1n=0)

n∏

i=1

P(tai ) PATTACH(ch, dir, cai )

(1− PSTOP(ch, dir,

adj︷︸︸︷

1i=1))

n=|args(h,dir)|Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 16 / 25

Page 77: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Methodology Learning

Learning: Viterbi EM

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 17 / 25

Page 78: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Methodology Learning

Learning: Viterbi EM

well-suited to long sentences,which are more punctuation-rich

(Spitkovsky et al., CoNLL 2010)

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 17 / 25

Page 79: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Methodology Learning

Learning: Viterbi EM

well-suited to long sentences,which are more punctuation-rich

(Spitkovsky et al., CoNLL 2010)

fast, simple and easily admits constraints(Spitkovsky et al., ACL 2010)

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 17 / 25

Page 80: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Constraints

Constraints: Parser Induction

the model, i.e., projective trees (Klein and Manning, 2004)

— Dependency Model with Valence (DMV)

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 18 / 25

Page 81: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Constraints

Constraints: Parser Induction

the model, i.e., projective trees (Klein and Manning, 2004)

— Dependency Model with Valence (DMV)

(((List (the fares (for ((flight) (number 891)))))) .)

partial bracketings (Pereira and Schabes, 1992)

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 18 / 25

Page 82: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Constraints

Constraints: Parser Induction

the model, i.e., projective trees (Klein and Manning, 2004)

— Dependency Model with Valence (DMV)

(((List (the fares (for ((flight) (number 891)))))) .)

partial bracketings (Pereira and Schabes, 1992)

– synchronous grammars (Alshawi and Douglas, 2000)– linear-time parsing (Seginer, 2007)– skewness of trees (Seginer, 2007)– Zipfian distribution of words (Seginer, 2007)– sparse posterior regularization (Ganchev et al., 2009)

– web markup-induced constraints (Spitkovsky et al., 2010)

– semantic cues (Naseem and Barzilay, 2011)

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 18 / 25

Page 83: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Unlexicalized

Experimental Results: Unlexicalized

directed dependency accuraciesfor baselines, inference, training and an oracle:

WSJ∞ (Section 23, all sentences)

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 19 / 25

Page 84: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Unlexicalized

Experimental Results: Unlexicalized

directed dependency accuraciesfor baselines, inference, training and an oracle:

WSJ∞ (Section 23, all sentences)

Standard Training 52.0

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 19 / 25

Page 85: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Unlexicalized

Experimental Results: Unlexicalized

directed dependency accuraciesfor baselines, inference, training and an oracle:

WSJ∞ (Section 23, all sentences)Punctuation as Words 41.7 (-10.3)

Standard Training 52.0

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 19 / 25

Page 86: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Unlexicalized

Experimental Results: Unlexicalized

directed dependency accuraciesfor baselines, inference, training and an oracle:

WSJ∞ (Section 23, all sentences)Punctuation as Words 41.7 (-10.3)

Standard Training 52.0w/Constrained Inference 54.0 (+2.0)

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 19 / 25

Page 87: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Unlexicalized

Experimental Results: Unlexicalized

directed dependency accuraciesfor baselines, inference, training and an oracle:

WSJ∞ (Section 23, all sentences)Punctuation as Words 41.7 (-10.3)

Standard Training 52.0w/Constrained Inference 54.0 (+2.0)

Constrained Training 55.6 (+3.6)

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 19 / 25

Page 88: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Unlexicalized

Experimental Results: Unlexicalized

directed dependency accuraciesfor baselines, inference, training and an oracle:

WSJ∞ (Section 23, all sentences)Punctuation as Words 41.7 (-10.3)

Standard Training 52.0w/Constrained Inference 54.0 (+2.0)

Constrained Training 55.6 (+3.6)w/Constrained Inference 57.4 (+1.8)

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 19 / 25

Page 89: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Unlexicalized

Experimental Results: Unlexicalized

directed dependency accuraciesfor baselines, inference, training and an oracle:

WSJ∞ (Section 23, all sentences)Punctuation as Words 41.7 (-10.3)

Standard Training 52.0w/Constrained Inference 54.0 (+2.0)

Constrained Training 55.6 (+3.6)w/Constrained Inference 57.4 (+1.8)

Supervised DMV 69.8

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 19 / 25

Page 90: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Unlexicalized

Experimental Results: Unlexicalized

directed dependency accuraciesfor baselines, inference, training and an oracle:

WSJ∞ (Section 23, all sentences)Punctuation as Words 41.7 (-10.3)

Standard Training 52.0w/Constrained Inference 54.0 (+2.0)

Constrained Training 55.6 (+3.6)w/Constrained Inference 57.4 (+1.8)

Supervised DMV 69.8w/Constrained Inference 73.0 (+3.2)

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 19 / 25

Page 91: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Lexicalized

Experimental Results: Lexicalized

directed dependency accuracies comparedto previous state-of-the-art

WSJ∞

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 20 / 25

Page 92: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Lexicalized

Experimental Results: Lexicalized

directed dependency accuracies comparedto previous state-of-the-art

WSJ∞

Unlexicalized 57.4

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 20 / 25

Page 93: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Lexicalized

Experimental Results: Lexicalized

directed dependency accuracies comparedto previous state-of-the-art

WSJ∞

(Spitkovsky et al., ACL 2010) 50.4

Unlexicalized 57.4

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 20 / 25

Page 94: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Lexicalized

Experimental Results: Lexicalized

directed dependency accuracies comparedto previous state-of-the-art

WSJ∞

(Spitkovsky et al., ACL 2010) 50.4Lexicalized (Gillenwater et al., 2010) 53.3

Unlexicalized 57.4

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 20 / 25

Page 95: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Lexicalized

Experimental Results: Lexicalized

directed dependency accuracies comparedto previous state-of-the-art

WSJ∞

(Spitkovsky et al., ACL 2010) 50.4Lexicalized (Gillenwater et al., 2010) 53.3

Lexicalized (Blunsom and Cohn, 2010) 55.7Unlexicalized 57.4

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 20 / 25

Page 96: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Lexicalized

Experimental Results: Lexicalized

directed dependency accuracies comparedto previous state-of-the-art

WSJ∞

(Spitkovsky et al., ACL 2010) 50.4Lexicalized (Gillenwater et al., 2010) 53.3

Lexicalized (Blunsom and Cohn, 2010) 55.7Unlexicalized 57.4

Lexicalized Constrained Training 58.0

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 20 / 25

Page 97: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Lexicalized

Experimental Results: Lexicalized

directed dependency accuracies comparedto previous state-of-the-art

WSJ∞

(Spitkovsky et al., ACL 2010) 50.4Lexicalized (Gillenwater et al., 2010) 53.3

Lexicalized (Blunsom and Cohn, 2010) 55.7Unlexicalized 57.4

Lexicalized Constrained Training 58.0w/Constrained Infernce 58.4

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 20 / 25

Page 98: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Without Gold Tags

Experimental Results: “Fully” Unsupervised

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 21 / 25

Page 99: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Without Gold Tags

Experimental Results: “Fully” Unsupervised

constraints sufficiently strong to abandon gold tags

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 21 / 25

Page 100: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Without Gold Tags

Experimental Results: “Fully” Unsupervised

constraints sufficiently strong to abandon gold tags

WSJ∞

(this work) 58.4

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 21 / 25

Page 101: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Without Gold Tags

Experimental Results: “Fully” Unsupervised

constraints sufficiently strong to abandon gold tags

WSJ∞

(this work) 58.4w/o Gold Tags 58.2

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 21 / 25

Page 102: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Without Gold Tags

Experimental Results: “Fully” Unsupervised

constraints sufficiently strong to abandon gold tags

WSJ∞

(this work) 58.4w/o Gold Tags 58.2

using Clark’s (2000) unsupervised clusters

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 21 / 25

Page 103: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Without Gold Tags

Experimental Results: “Fully” Unsupervised

constraints sufficiently strong to abandon gold tags

WSJ∞

(this work) 58.4w/o Gold Tags 58.2

using Clark’s (2000) unsupervised clusters— constructed by Finkel and Manning (2009) for NER

http://nlp.stanford.edu/software/

stanford-postagger-2008-09-28.tar.gz:

models/egw.bnc.200

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 21 / 25

Page 104: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Without Gold Tags

Experimental Results: “Fully” Unsupervised

constraints sufficiently strong to abandon gold tags

WSJ∞

(this work) 58.4w/o Gold Tags 58.2

using Clark’s (2000) unsupervised clusters— constructed by Finkel and Manning (2009) for NER

http://nlp.stanford.edu/software/

stanford-postagger-2008-09-28.tar.gz:

models/egw.bnc.200

(Come see our poster at EMNLP!)

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 21 / 25

Page 105: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Multi-Lingual

Experimental Results: Multi-Lingualfurther evaluation against CoNLL 2006/7 data sets— results generalize across languages:

Arabic 2006’7

Basque ’7Bulgarian ’6Catalan ’7Czech ’6

’7Danish ’6Dutch ’6English ’7German ’6Greek ’7Hungarian ’7Italian ’7Japanese ’6Portuguese ’6Slovenian ’6Spanish ’6Swedish ’6Turkish ’6

’7

Average:

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 22 / 25

Page 106: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Multi-Lingual

Experimental Results: Multi-Lingualfurther evaluation against CoNLL 2006/7 data sets— results generalize across languages:

Inference OnlyArabic 2006 +0.1

’7 +0.9Basque ’7 +0.8Bulgarian ’6 +1.1Catalan ’7 +0.8Czech ’6 +0.9

’7 +1.0Danish ’6 +0.9Dutch ’6 +1.0English ’7 +1.3German ’6 +0.8Greek ’7 +0.5Hungarian ’7 +0.4Italian ’7 +0.1Japanese ’6 +0.0Portuguese ’6 +0.7Slovenian ’6 +2.0Spanish ’6 +0.8Swedish ’6 +0.5Turkish ’6 +0.1

’7 +0.2

Average: +0.7

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 22 / 25

Page 107: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Experimental Results Multi-Lingual

Experimental Results: Multi-Lingualfurther evaluation against CoNLL 2006/7 data sets— results generalize across languages:

Inference Only Training & InferenceArabic 2006 +0.1 +1.1

’7 +0.9 +2.6Basque ’7 +0.8 +0.6Bulgarian ’6 +1.1 +1.6Catalan ’7 +0.8 +0.9Czech ’6 +0.9 +3.0

’7 +1.0 +2.7Danish ’6 +0.9 +0.2Dutch ’6 +1.0 +3.0English ’7 +1.3 +2.8German ’6 +0.8 +1.6Greek ’7 +0.5 +0.7Hungarian ’7 +0.4 +1.4Italian ’7 +0.1 -0.8Japanese ’6 +0.0 +0.1Portuguese ’6 +0.7 +0.8Slovenian ’6 +2.0 +2.8Spanish ’6 +0.8 +0.8Swedish ’6 +0.5 +0.8Turkish ’6 +0.1 +1.0

’7 +0.2 +0.1

Average: +0.7 +1.3

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 22 / 25

Page 108: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Conclusion Thoughts

Thoughts:

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 23 / 25

Page 109: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Conclusion Thoughts

Thoughts:

extend existing parsers

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 23 / 25

Page 110: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Conclusion Thoughts

Thoughts:

extend existing parsers— no need to retrain models

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 23 / 25

Page 111: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Conclusion Thoughts

Thoughts:

extend existing parsers— no need to retrain models— supervised systems?

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 23 / 25

Page 112: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Conclusion Thoughts

Thoughts:

extend existing parsers— no need to retrain models— supervised systems?

would prosody aid with induction from speech?

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 23 / 25

Page 113: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Conclusion Thoughts

Thoughts:

extend existing parsers— no need to retrain models— supervised systems?

would prosody aid with induction from speech?— “as words” breaks n-grams (Kahn et al., 2005)

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 23 / 25

Page 114: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Conclusion Summary

Summary:

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 24 / 25

Page 115: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Conclusion Summary

Summary:

punctuation helps dependency grammar induction

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 24 / 25

Page 116: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Conclusion Summary

Summary:

punctuation helps dependency grammar induction— even better than markup...

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 24 / 25

Page 117: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Conclusion Summary

Summary:

punctuation helps dependency grammar induction— even better than markup...

a popular approach: powerful models

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 24 / 25

Page 118: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Conclusion Summary

Summary:

punctuation helps dependency grammar induction— even better than markup...

a popular approach: powerful models— priors prevent overfitting

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 24 / 25

Page 119: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Conclusion Summary

Summary:

punctuation helps dependency grammar induction— even better than markup...

a popular approach: powerful models— priors prevent overfitting

an alternative: overly simple models

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 24 / 25

Page 120: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Conclusion Summary

Summary:

punctuation helps dependency grammar induction— even better than markup...

a popular approach: powerful models— priors prevent overfitting

an alternative: overly simple models— constraints prevent underfitting

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 24 / 25

Page 121: Punctuation: Making a Point · — e.g., punctuation and capitalization raw word streams often difficult even for humans — e.g., transcribed utterances (Kim and Woodland, 2002)

Conclusion Thanks! Questions?

Thanks!

Punctuation. It works...

Any questions?

Spitkovsky et al. (Stanford & Google) Punctuation: Making a Point CoNLL (2011-06-23) 25 / 25