profiting from mark-up - stanford nlp groupnlp.stanford.edu/pubs/markup-slides.pdf · spitkovsky...

188
Profiting from Mark-Up: Hyper-Text Annotations for Guided Parsing Valentin I. Spitkovsky with Daniel Jurafsky (Stanford University) and Hiyan Alshawi (Google Inc.) Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 1 / 33

Upload: others

Post on 16-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Profiting from Mark-Up:

Hyper-Text Annotationsfor Guided Parsing

Valentin I. Spitkovsky

with Daniel Jurafsky (Stanford University)and Hiyan Alshawi (Google Inc.)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 1 / 33

Page 2: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Supervised and Unsupervised

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 2 / 33

Page 3: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Supervised and Unsupervised

compact summaries of high-level insights into a domain

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 2 / 33

Page 4: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Supervised and Unsupervised

compact summaries of high-level insights into a domain— e.g., a sentence should have a verb

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 2 / 33

Page 5: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Supervised and Unsupervised

compact summaries of high-level insights into a domain— e.g., a sentence should have a verb— can significantly reduce the search space

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 2 / 33

Page 6: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Supervised and Unsupervised

compact summaries of high-level insights into a domain— e.g., a sentence should have a verb— can significantly reduce the search space— easier to list than annotating data (a few key rules)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 2 / 33

Page 7: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Supervised and Unsupervised

compact summaries of high-level insights into a domain— e.g., a sentence should have a verb— can significantly reduce the search space— easier to list than annotating data (a few key rules)— enforce rather than model (avoid non-local features)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 2 / 33

Page 8: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Supervised and Unsupervised

compact summaries of high-level insights into a domain— e.g., a sentence should have a verb— can significantly reduce the search space— easier to list than annotating data (a few key rules)— enforce rather than model (avoid non-local features)— enable simple, painless NLP, e.g., joint inference

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 2 / 33

Page 9: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Supervised and Unsupervised

compact summaries of high-level insights into a domain— e.g., a sentence should have a verb— can significantly reduce the search space— easier to list than annotating data (a few key rules)— enforce rather than model (avoid non-local features)— enable simple, painless NLP, e.g., joint inference“Integer Linear Programming in NLP” Tutorial (Chang et al., 2010)

http://l2r.cs.uiuc.edu/˜danr/Talks/ILP-CCM-Tutorial-NAACL10.pdf

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 2 / 33

Page 10: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Supervised and Unsupervised

compact summaries of high-level insights into a domain— e.g., a sentence should have a verb— can significantly reduce the search space— easier to list than annotating data (a few key rules)— enforce rather than model (avoid non-local features)— enable simple, painless NLP, e.g., joint inference“Integer Linear Programming in NLP” Tutorial (Chang et al., 2010)

http://l2r.cs.uiuc.edu/˜danr/Talks/ILP-CCM-Tutorial-NAACL10.pdf

relevant to unsupervised learning (less rope to hang self)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 2 / 33

Page 11: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Supervised and Unsupervised

compact summaries of high-level insights into a domain— e.g., a sentence should have a verb— can significantly reduce the search space— easier to list than annotating data (a few key rules)— enforce rather than model (avoid non-local features)— enable simple, painless NLP, e.g., joint inference“Integer Linear Programming in NLP” Tutorial (Chang et al., 2010)

http://l2r.cs.uiuc.edu/˜danr/Talks/ILP-CCM-Tutorial-NAACL10.pdf

relevant to unsupervised learning (less rope to hang self)— inherently underconstrained problems...

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 2 / 33

Page 12: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Supervised and Unsupervised

compact summaries of high-level insights into a domain— e.g., a sentence should have a verb— can significantly reduce the search space— easier to list than annotating data (a few key rules)— enforce rather than model (avoid non-local features)— enable simple, painless NLP, e.g., joint inference“Integer Linear Programming in NLP” Tutorial (Chang et al., 2010)

http://l2r.cs.uiuc.edu/˜danr/Talks/ILP-CCM-Tutorial-NAACL10.pdf

relevant to unsupervised learning (less rope to hang self)— inherently underconstrained problems...— in general, steer at the “right” regularities in data

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 2 / 33

Page 13: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Supervised and Unsupervised

compact summaries of high-level insights into a domain— e.g., a sentence should have a verb— can significantly reduce the search space— easier to list than annotating data (a few key rules)— enforce rather than model (avoid non-local features)— enable simple, painless NLP, e.g., joint inference“Integer Linear Programming in NLP” Tutorial (Chang et al., 2010)

http://l2r.cs.uiuc.edu/˜danr/Talks/ILP-CCM-Tutorial-NAACL10.pdf

relevant to unsupervised learning (less rope to hang self)— inherently underconstrained problems...— in general, steer at the “right” regularities in data— specifically, useful for grammar (parser) induction

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 2 / 33

Page 14: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Supervised and Unsupervised

compact summaries of high-level insights into a domain— e.g., a sentence should have a verb— can significantly reduce the search space— easier to list than annotating data (a few key rules)— enforce rather than model (avoid non-local features)— enable simple, painless NLP, e.g., joint inference“Integer Linear Programming in NLP” Tutorial (Chang et al., 2010)

http://l2r.cs.uiuc.edu/˜danr/Talks/ILP-CCM-Tutorial-NAACL10.pdf

relevant to unsupervised learning (less rope to hang self)— inherently underconstrained problems...— in general, steer at the “right” regularities in data— specifically, useful for grammar (parser) induction— linguistic structure underdetermined by raw text

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 2 / 33

Page 15: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Parser and Grammar Induction

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 3 / 33

Page 16: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Parser and Grammar Induction

the model

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 3 / 33

Page 17: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Parser and Grammar Induction

the model, e.g., projective trees (Klein and Manning, 2004)

— Dependency Model with Valence (DMV)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 3 / 33

Page 18: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Parser and Grammar Induction

the model, e.g., projective trees (Klein and Manning, 2004)

— Dependency Model with Valence (DMV)

partial bracketings (Pereira and Schabes, 1992)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 3 / 33

Page 19: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Parser and Grammar Induction

the model, e.g., projective trees (Klein and Manning, 2004)

— Dependency Model with Valence (DMV)

partial bracketings (Pereira and Schabes, 1992)

synchronous grammars (Alshawi and Douglas, 2000)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 3 / 33

Page 20: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Parser and Grammar Induction

the model, e.g., projective trees (Klein and Manning, 2004)

— Dependency Model with Valence (DMV)

partial bracketings (Pereira and Schabes, 1992)

synchronous grammars (Alshawi and Douglas, 2000)

linear-time parsing (Seginer, 2007)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 3 / 33

Page 21: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Parser and Grammar Induction

the model, e.g., projective trees (Klein and Manning, 2004)

— Dependency Model with Valence (DMV)

partial bracketings (Pereira and Schabes, 1992)

synchronous grammars (Alshawi and Douglas, 2000)

linear-time parsing (Seginer, 2007)

skewness of trees (Seginer, 2007)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 3 / 33

Page 22: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Parser and Grammar Induction

the model, e.g., projective trees (Klein and Manning, 2004)

— Dependency Model with Valence (DMV)

partial bracketings (Pereira and Schabes, 1992)

synchronous grammars (Alshawi and Douglas, 2000)

linear-time parsing (Seginer, 2007)

skewness of trees (Seginer, 2007)

Zipfian distribution of words (Seginer, 2007)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 3 / 33

Page 23: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Parser and Grammar Induction

the model, e.g., projective trees (Klein and Manning, 2004)

— Dependency Model with Valence (DMV)

partial bracketings (Pereira and Schabes, 1992)

synchronous grammars (Alshawi and Douglas, 2000)

linear-time parsing (Seginer, 2007)

skewness of trees (Seginer, 2007)

Zipfian distribution of words (Seginer, 2007)

sparse posterior regularization (Ganchev et al., 2009)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 3 / 33

Page 24: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Parser and Grammar Induction

the model, e.g., projective trees (Klein and Manning, 2004)

— Dependency Model with Valence (DMV)

partial bracketings (Pereira and Schabes, 1992)

synchronous grammars (Alshawi and Douglas, 2000)

linear-time parsing (Seginer, 2007)

skewness of trees (Seginer, 2007)

Zipfian distribution of words (Seginer, 2007)

sparse posterior regularization (Ganchev et al., 2009)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 3 / 33

Page 25: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Parser and Grammar Induction

the model, e.g., projective trees (Klein and Manning, 2004)

— Dependency Model with Valence (DMV)

(((List (the fares (for ((flight) (number 891)))))) .)

partial bracketings (Pereira and Schabes, 1992)

synchronous grammars (Alshawi and Douglas, 2000)

linear-time parsing (Seginer, 2007)

skewness of trees (Seginer, 2007)

Zipfian distribution of words (Seginer, 2007)

sparse posterior regularization (Ganchev et al., 2009)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 3 / 33

Page 26: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Partial Bracketings

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33

Page 27: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Partial Bracketingsplay well with EM (Pereira and Schabes, 1992)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33

Page 28: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Partial Bracketingsplay well with EM (Pereira and Schabes, 1992)

— improved time complexity per iteration

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33

Page 29: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Partial Bracketingsplay well with EM (Pereira and Schabes, 1992)

— improved time complexity per iteration— fewer iterations to reach a good grammar

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33

Page 30: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Partial Bracketingsplay well with EM (Pereira and Schabes, 1992)

— improved time complexity per iteration— fewer iterations to reach a good grammar— better agreement with qualitative judgments

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33

Page 31: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Partial Bracketingsplay well with EM (Pereira and Schabes, 1992)

— improved time complexity per iteration— fewer iterations to reach a good grammar— better agreement with qualitative judgments

problem: requires supervision (worst case — parse trees)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33

Page 32: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Partial Bracketingsplay well with EM (Pereira and Schabes, 1992)

— improved time complexity per iteration— fewer iterations to reach a good grammar— better agreement with qualitative judgments

problem: requires supervision (worst case — parse trees)

how to make it work, in the absence of a treebank?

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33

Page 33: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Partial Bracketingsplay well with EM (Pereira and Schabes, 1992)

— improved time complexity per iteration— fewer iterations to reach a good grammar— better agreement with qualitative judgments

problem: requires supervision (worst case — parse trees)

how to make it work, in the absence of a treebank?◮ more, partially annotated corpora:

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33

Page 34: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Partial Bracketingsplay well with EM (Pereira and Schabes, 1992)

— improved time complexity per iteration— fewer iterations to reach a good grammar— better agreement with qualitative judgments

problem: requires supervision (worst case — parse trees)

how to make it work, in the absence of a treebank?◮ more, partially annotated corpora:

— English POS chunking (Chen and Lee, 1995)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33

Page 35: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Partial Bracketingsplay well with EM (Pereira and Schabes, 1992)

— improved time complexity per iteration— fewer iterations to reach a good grammar— better agreement with qualitative judgments

problem: requires supervision (worst case — parse trees)

how to make it work, in the absence of a treebank?◮ more, partially annotated corpora:

— English POS chunking (Chen and Lee, 1995)

— Japanese clause splitting (Inui and Kotani, 2001)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33

Page 36: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Partial Bracketingsplay well with EM (Pereira and Schabes, 1992)

— improved time complexity per iteration— fewer iterations to reach a good grammar— better agreement with qualitative judgments

problem: requires supervision (worst case — parse trees)

how to make it work, in the absence of a treebank?◮ more, partially annotated corpora:

— English POS chunking (Chen and Lee, 1995)

— Japanese clause splitting (Inui and Kotani, 2001)◮ our approach:

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33

Page 37: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Partial Bracketingsplay well with EM (Pereira and Schabes, 1992)

— improved time complexity per iteration— fewer iterations to reach a good grammar— better agreement with qualitative judgments

problem: requires supervision (worst case — parse trees)

how to make it work, in the absence of a treebank?◮ more, partially annotated corpora:

— English POS chunking (Chen and Lee, 1995)

— Japanese clause splitting (Inui and Kotani, 2001)◮ our approach:

— would like to scale up to the web anyway

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33

Page 38: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Partial Bracketingsplay well with EM (Pereira and Schabes, 1992)

— improved time complexity per iteration— fewer iterations to reach a good grammar— better agreement with qualitative judgments

problem: requires supervision (worst case — parse trees)

how to make it work, in the absence of a treebank?◮ more, partially annotated corpora:

— English POS chunking (Chen and Lee, 1995)

— Japanese clause splitting (Inui and Kotani, 2001)◮ our approach:

— would like to scale up to the web anyway— use what’s at hand (Verne, 1873; 1972)

Phileas Fogg

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33

Page 39: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Partial Bracketingsplay well with EM (Pereira and Schabes, 1992)

— improved time complexity per iteration— fewer iterations to reach a good grammar— better agreement with qualitative judgments

problem: requires supervision (worst case — parse trees)

how to make it work, in the absence of a treebank?◮ more, partially annotated corpora:

— English POS chunking (Chen and Lee, 1995)

— Japanese clause splitting (Inui and Kotani, 2001)◮ our approach:

— would like to scale up to the web anyway— use what’s at hand (Verne, 1873; 1972)

Phileas Fogg

— HTML structure

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33

Page 40: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Constraints: Partial Bracketingsplay well with EM (Pereira and Schabes, 1992)

— improved time complexity per iteration— fewer iterations to reach a good grammar— better agreement with qualitative judgments

problem: requires supervision (worst case — parse trees)

how to make it work, in the absence of a treebank?◮ more, partially annotated corpora:

— English POS chunking (Chen and Lee, 1995)

— Japanese clause splitting (Inui and Kotani, 2001)◮ our approach:

— would like to scale up to the web anyway— use what’s at hand (Verne, 1873; 1972)

Phileas Fogg

— HTML structure

solution: mark-up!

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33

Page 41: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Web Mark-Up: Diamonds in the Rough

suggestive example:

..., whereas McCain is secure on the topic, Obama<a>[VP worries about winning the pro-Israel vote]</a>.

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 5 / 33

Page 42: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Web Mark-Up: Diamonds in the Rough

suggestive example:

..., whereas McCain is secure on the topic, Obama<a>[VP worries about winning the pro-Israel vote]</a>.

intuition:diamondsin therough

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 5 / 33

Page 43: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Web Mark-Up: Diamonds in the Rough

suggestive example:

..., whereas McCain is secure on the topic, Obama<a>[VP worries about winning the pro-Israel vote]</a>.

intuition:diamondsin therough

natural language pre-processing (NLPP?):

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 5 / 33

Page 44: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Web Mark-Up: Diamonds in the Rough

suggestive example:

..., whereas McCain is secure on the topic, Obama<a>[VP worries about winning the pro-Israel vote]</a>.

intuition:diamondsin therough

natural language pre-processing (NLPP?):— stripping out HTML is an ugly chore...

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 5 / 33

Page 45: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Web Mark-Up: Diamonds in the Rough

suggestive example:

..., whereas McCain is secure on the topic, Obama<a>[VP worries about winning the pro-Israel vote]</a>.

intuition:diamondsin therough

natural language pre-processing (NLPP?):— stripping out HTML is an ugly chore...— instead of rushing to discard it, try polishing!

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 5 / 33

Page 46: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Outline:

structure of this talk (as guided by the time constraints):

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 6 / 33

Page 47: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Outline:

structure of this talk (as guided by the time constraints):

1 linguistic analysis of a single blog

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 6 / 33

Page 48: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Outline:

structure of this talk (as guided by the time constraints):

1 linguistic analysis of a single blog

— is there a connection between syntax and mark-up?

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 6 / 33

Page 49: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Outline:

structure of this talk (as guided by the time constraints):

1 linguistic analysis of a single blog

— is there a connection between syntax and mark-up?— yes...

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 6 / 33

Page 50: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Outline:

structure of this talk (as guided by the time constraints):

1 linguistic analysis of a single blog

— is there a connection between syntax and mark-up?— yes... (but what is it? and is it useful?)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 6 / 33

Page 51: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Outline:

structure of this talk (as guided by the time constraints):

1 linguistic analysis of a single blog

— is there a connection between syntax and mark-up?— yes... (but what is it? and is it useful?)

2 proposed parsing constraints, refined from mark-up

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 6 / 33

Page 52: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Outline:

structure of this talk (as guided by the time constraints):

1 linguistic analysis of a single blog

— is there a connection between syntax and mark-up?— yes... (but what is it? and is it useful?)

2 proposed parsing constraints, refined from mark-up

3 experimental results for unsupervised dependency parsing

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 6 / 33

Page 53: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Outline:

structure of this talk (as guided by the time constraints):

1 linguistic analysis of a single blog

— is there a connection between syntax and mark-up?— yes... (but what is it? and is it useful?)

2 proposed parsing constraints, refined from mark-up

3 experimental results for unsupervised dependency parsing— parsed the web

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 6 / 33

Page 54: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Outline:

structure of this talk (as guided by the time constraints):

1 linguistic analysis of a single blog

— is there a connection between syntax and mark-up?— yes... (but what is it? and is it useful?)

2 proposed parsing constraints, refined from mark-up

3 experimental results for unsupervised dependency parsing— parsed the web— but you don’t have to...

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 6 / 33

Page 55: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Outline:

structure of this talk (as guided by the time constraints):

1 linguistic analysis of a single blog

— is there a connection between syntax and mark-up?— yes... (but what is it? and is it useful?)

2 proposed parsing constraints, refined from mark-up

3 experimental results for unsupervised dependency parsing— parsed the web— but you don’t have to...— best results with just the blog

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 6 / 33

Page 56: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Outline:

structure of this talk (as guided by the time constraints):

1 linguistic analysis of a single blog

— is there a connection between syntax and mark-up?— yes... (but what is it? and is it useful?)

2 proposed parsing constraints, refined from mark-up

3 experimental results for unsupervised dependency parsing— parsed the web— but you don’t have to...— best results with just the blog— ... web news also state-of-the-art

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 6 / 33

Page 57: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Outline:

structure of this talk (as guided by the time constraints):

1 linguistic analysis of a single blog

— is there a connection between syntax and mark-up?— yes... (but what is it? and is it useful?)

2 proposed parsing constraints, refined from mark-up

3 experimental results for unsupervised dependency parsing— parsed the web— but you don’t have to...— best results with just the blog— ... web news also state-of-the-art

minor yet recurring theme: less is more

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 6 / 33

Page 58: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Outline:

dropped details:

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 7 / 33

Page 59: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Outline:

dropped details:

— model: Dependency Model with Valence (DMV)[POS tags] (Klein and Manning, 2004)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 7 / 33

Page 60: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Outline:

dropped details:

— model: Dependency Model with Valence (DMV)[POS tags] (Klein and Manning, 2004)

— learning engine: Viterbi EM (not Inside-Outside)[CoNLL] (Spitkovsky et al., 2010)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 7 / 33

Page 61: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Outline:

dropped details:

— model: Dependency Model with Valence (DMV)[POS tags] (Klein and Manning, 2004)

— learning engine: Viterbi EM (not Inside-Outside)[CoNLL] (Spitkovsky et al., 2010)

— methodology: experimental design (hundreds of runs)[ACL] (Spitkovsky et al., 2010)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 7 / 33

Page 62: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Data:

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 8 / 33

Page 63: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Data:

variety of data-set sizes and genres:

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 8 / 33

Page 64: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Data:

variety of data-set sizes and genres:— from biggest to smallest

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 8 / 33

Page 65: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Data:

variety of data-set sizes and genres:— from biggest to smallest, from messiest to cleanest

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 8 / 33

Page 66: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Data:

variety of data-set sizes and genres:— from biggest to smallest, from messiest to cleanest

1 English web

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 8 / 33

Page 67: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Data:

variety of data-set sizes and genres:— from biggest to smallest, from messiest to cleanest

1 English web http://google.com/en

— nearly 100B POS tokens

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 8 / 33

Page 68: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Data:

variety of data-set sizes and genres:— from biggest to smallest, from messiest to cleanest

1 English web http://google.com/en

— nearly 100B POS tokens— TnT-tagged (Brants, 2000)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 8 / 33

Page 69: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Data:

variety of data-set sizes and genres:— from biggest to smallest, from messiest to cleanest

1 English web http://google.com/en

— nearly 100B POS tokens— TnT-tagged (Brants, 2000)

2 web news

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 8 / 33

Page 70: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Data:

variety of data-set sizes and genres:— from biggest to smallest, from messiest to cleanest

1 English web http://google.com/en

— nearly 100B POS tokens— TnT-tagged (Brants, 2000)

2 web news http://news.google.com/

— about 30B tokens

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 8 / 33

Page 71: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Data:

variety of data-set sizes and genres:— from biggest to smallest, from messiest to cleanest

1 English web http://google.com/en

— nearly 100B POS tokens— TnT-tagged (Brants, 2000)

2 web news http://news.google.com/

— about 30B tokens3 political opinion blog

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 8 / 33

Page 72: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Data:

variety of data-set sizes and genres:— from biggest to smallest, from messiest to cleanest

1 English web http://google.com/en

— nearly 100B POS tokens— TnT-tagged (Brants, 2000)

2 web news http://news.google.com/

— about 30B tokens3 political opinion blog http://danielpipes.org/

— a little over 1M tokens

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 8 / 33

Page 73: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Data:

variety of data-set sizes and genres:— from biggest to smallest, from messiest to cleanest

1 English web http://google.com/en

— nearly 100B POS tokens— TnT-tagged (Brants, 2000)

2 web news http://news.google.com/

— about 30B tokens3 political opinion blog http://danielpipes.org/

— a little over 1M tokens— manually cleaned up (for analysis)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 8 / 33

Page 74: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Data:

variety of data-set sizes and genres:— from biggest to smallest, from messiest to cleanest

1 English web http://google.com/en

— nearly 100B POS tokens— TnT-tagged (Brants, 2000)

2 web news http://news.google.com/

— about 30B tokens3 political opinion blog http://danielpipes.org/

— a little over 1M tokens— manually cleaned up (for analysis)

⋆ Charniak-parsed (Charniak and Johnson, 2005)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 8 / 33

Page 75: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Data:

variety of data-set sizes and genres:— from biggest to smallest, from messiest to cleanest

1 English web http://google.com/en

— nearly 100B POS tokens— TnT-tagged (Brants, 2000)

2 web news http://news.google.com/

— about 30B tokens3 political opinion blog http://danielpipes.org/

— a little over 1M tokens— manually cleaned up (for analysis)

⋆ Charniak-parsed (Charniak and Johnson, 2005)⋆ Stanford-tagged (Toutanova et al., 2003)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 8 / 33

Page 76: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Data:

variety of data-set sizes and genres:— from biggest to smallest, from messiest to cleanest

1 English web http://google.com/en

— nearly 100B POS tokens— TnT-tagged (Brants, 2000)

2 web news http://news.google.com/

— about 30B tokens3 political opinion blog http://danielpipes.org/

— a little over 1M tokens— manually cleaned up (for analysis)

⋆ Charniak-parsed (Charniak and Johnson, 2005)⋆ Stanford-tagged (Toutanova et al., 2003)

4 WSJ — just over 1M tokens (Marcus et al., 1993)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 8 / 33

Page 77: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Overview

Data:

variety of data-set sizes and genres:— from biggest to smallest, from messiest to cleanest

1 English web http://google.com/en

— nearly 100B POS tokens— TnT-tagged (Brants, 2000)

2 web news http://news.google.com/

— about 30B tokens3 political opinion blog http://danielpipes.org/

— a little over 1M tokens— manually cleaned up (for analysis)

⋆ Charniak-parsed (Charniak and Johnson, 2005)⋆ Stanford-tagged (Toutanova et al., 2003)

4 WSJ — just over 1M tokens (Marcus et al., 1993)5 Brown — under 400K tokens (Francis and Kucera, 1979)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 8 / 33

Page 78: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: POS Sequences <a, b, i, u>

%NNP NNP 16.1NNP 8.3

NNP NNP NNP 5.4NN 5.4JJ NN 2.6

DT NNP NNP 1.8NNS 1.8JJ 1.5VBD 1.3

DT NNP NNP NNP 1.2JJ NNS 1.1NNP NN 1.0NN NN 1.0VBN 0.8

NNP NNP NNP NNP 0.850.0

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 9 / 33

Page 79: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: Dominating Non-Terminals

%NP 74.5VP 12.9S 6.8PP 1.6ADJP 0.9FRAG 0.8ADVP 0.5SBAR 0.5PRN 0.2NX 0.2

99.0

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 10 / 33

Page 80: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: Common Constituents

..., but [S [NP the <a>Toronto Star][VP reports [NP this][PP in the softest possible way]</a>,[S stating ...]]]

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 11 / 33

Page 81: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: Common Constituents

..., but [S [NP the <a>Toronto Star][VP reports [NP this][PP in the softest possible way]</a>,[S stating ...]]]

S → NP VP

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 11 / 33

Page 82: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: Common Constituents

..., but [S [NP the <a>Toronto Star][VP reports [NP this][PP in the softest possible way]</a>,[S stating ...]]]

S → NP VP→ DT NNP NNP VBZ NP PP S

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 11 / 33

Page 83: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: Constituent Productions

%NP → NNP NNP 9.6NP → NNP 4.6NP → NP PP 3.4

NP → NNP NNP NNP 2.4NP → DT NNP NNP 2.1NP → NN 1.8

NP → DT NNP NNP NNP 1.7NP → DT NN 1.7NP → DT NNP NNP 1.6S → NP VP 1.4

NP → DT NNP NNP NNP 1.2NP → DT JJ NN 1.1NP → NNS 1.0NP → JJ NN 0.8NP → NP NP 0.8

35.3

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 12 / 33

Page 84: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: Constituent Productions

%NP → NNP NNP 9.6NP → NNP 4.6NP → NP PP 3.4

NP → NNP NNP NNP 2.4NP → DT NNP NNP 2.1NP → NN 1.8

NP → DT NNP NNP NNP 1.7NP → DT NN 1.7NP → DT NNP NNP 1.6S → NP VP 1.4

NP → DT NNP NNP NNP 1.2NP → DT JJ NN 1.1NP → NNS 1.0NP → JJ NN 0.8NP → NP NP 0.8

35.3

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 12 / 33

Page 85: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: Constituent Productions

%NP → NNP NNP 9.6NP → NNP 4.6NP → NP PP 3.4

NP → NNP NNP NNP 2.4NP → DT NNP NNP 2.1NP → NN 1.8

NP → DT NNP NNP NNP 1.7NP → DT NN 1.7NP → DT NNP NNP 1.6S → NP VP 1.4

NP → DT NNP NNP NNP 1.2NP → DT JJ NN 1.1NP → NNS 1.0NP → JJ NN 0.8NP → NP NP 0.8

35.3

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 12 / 33

Page 86: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: Common Dependencies

..., but [S [NP the <a>Toronto Star][VP reports [NP this][PP in the softest possible way]</a>,[S stating ...]]]

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 13 / 33

Page 87: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: Common Dependencies

..., but [S [NP the <a>Toronto Star][VP reports [NP this][PP in the softest possible way]</a>,[S stating ...]]]

DT NNP NNP VBZ DT IN DT JJS JJ NN

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 13 / 33

Page 88: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: Common Dependencies

..., but [S [NP the <a>Toronto Star][VP reports [NP this][PP in the softest possible way]</a>,[S stating ...]]]

DT NNP NNP VBZ DT IN DT JJS JJ NN

DT NNP VBZ

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 13 / 33

Page 89: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: Common Dependencies

..., but [S [NP the <a>Toronto Star][VP reports [NP this][PP in the softest possible way]</a>,[S stating ...]]]

DT NNP NNP VBZ DT IN DT JJS JJ NN

DT NNP VBZ

“the <a>Star reports</a>”

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 13 / 33

Page 90: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: Head-Outward Spawns

%NNP 24.4NN 8.1

DT NNP 6.1DT NN 5.9NNS 4.5NNPS 1.4VBG 1.3

NNP NNP NN 1.2VBD 1.0IN 1.0VBN 1.0

DT JJ NN 0.9VBZ 0.9

POS NNP 0.9JJ 0.8

59.4

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 14 / 33

Page 91: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: Head-Outward Spawns

%NNP 24.4NN 8.1

DT NNP 6.1DT NN 5.9NNS 4.5NNPS 1.4VBG 1.3

NNP NNP NN 1.2VBD 1.0IN 1.0VBN 1.0

DT JJ NN 0.9VBZ 0.9

POS NNP 0.9JJ 0.8

59.4

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 14 / 33

Page 92: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: Head-Outward Spawns

%NNP 24.4NN 8.1

DT NNP 6.1DT NN 5.9NNS 4.5NNPS 1.4VBG 1.3

NNP NNP NN 1.2VBD 1.0IN 1.0VBN 1.0

DT JJ NN 0.9VBZ 0.9

POS NNP 0.9JJ 0.8

59.4

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 14 / 33

Page 93: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: Exception

... [NP a 1994 <i>New Yorker</i> article] ...

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 15 / 33

Page 94: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: Exception

... [NP a 1994 <i>New Yorker</i> article] ...

consequence of bare NPs— ... and “head percolation” rules

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 15 / 33

Page 95: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: Summary

not just single words (lots of long noun phrases)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 16 / 33

Page 96: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: Summary

not just single words (lots of long noun phrases)

some verbs, adjectives, etc. (i.e., not just nouns)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 16 / 33

Page 97: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: Summary

not just single words (lots of long noun phrases)

some verbs, adjectives, etc. (i.e., not just nouns)

apparent agreement with constituents

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 16 / 33

Page 98: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: Summary

not just single words (lots of long noun phrases)

some verbs, adjectives, etc. (i.e., not just nouns)

apparent agreement with constituents

and also with dependencies

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 16 / 33

Page 99: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: Summary

not just single words (lots of long noun phrases)

some verbs, adjectives, etc. (i.e., not just nouns)

apparent agreement with constituents

and also with dependencies

— but is there enough mark-up?

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 16 / 33

Page 100: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: Summary

not just single words (lots of long noun phrases)

some verbs, adjectives, etc. (i.e., not just nouns)

apparent agreement with constituents

and also with dependencies

— but is there enough mark-up?

11% of all sentences in the blog are annotated

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 16 / 33

Page 101: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Linguistic Analysis

Syntax of Mark-Up: Summary

not just single words (lots of long noun phrases)

some verbs, adjectives, etc. (i.e., not just nouns)

apparent agreement with constituents

and also with dependencies

— but is there enough mark-up?

11% of all sentences in the blog are annotated

9% have multi-token bracketings

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 16 / 33

Page 102: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Constituents?

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 17 / 33

Page 103: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Constituents?

48.0% agreement with Charniak’s trees

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 17 / 33

Page 104: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Constituents?

48.0% agreement with Charniak’s trees, e.g.,

... in [NP<a>[NP an analysis ]</a>[PP of perhaps themost astonishing PC item I have yet stumbled upon]].

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 17 / 33

Page 105: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Constituents?

48.0% agreement with Charniak’s trees, e.g.,

... in [NP<a>[NP an analysis ]</a>[PP of perhaps themost astonishing PC item I have yet stumbled upon]].

these are rough diamonds...

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 17 / 33

Page 106: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Constituents?

48.0% agreement with Charniak’s trees, e.g.,

... in [NP<a>[NP an analysis ]</a>[PP of perhaps themost astonishing PC item I have yet stumbled upon]].

these are rough diamonds...

many disagreements due to treebank idiosyncrasies:

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 17 / 33

Page 107: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Constituents?

48.0% agreement with Charniak’s trees, e.g.,

... in [NP<a>[NP an analysis ]</a>[PP of perhaps themost astonishing PC item I have yet stumbled upon]].

these are rough diamonds...

many disagreements due to treebank idiosyncrasies:— bare NPs (internal structure)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 17 / 33

Page 108: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Constituents?

48.0% agreement with Charniak’s trees, e.g.,

... in [NP<a>[NP an analysis ]</a>[PP of perhaps themost astonishing PC item I have yet stumbled upon]].

these are rough diamonds...

many disagreements due to treebank idiosyncrasies:— bare NPs (internal structure)— N-bars (missing determiners)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 17 / 33

Page 109: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Constituents?

48.0% agreement with Charniak’s trees, e.g.,

... in [NP<a>[NP an analysis ]</a>[PP of perhaps themost astonishing PC item I have yet stumbled upon]].

these are rough diamonds...

many disagreements due to treebank idiosyncrasies:— bare NPs (internal structure)— N-bars (missing determiners)

... but we’ll polish them anyway!

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 17 / 33

Page 110: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Dependencies!

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 18 / 33

Page 111: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Dependencies!

a more stylistically-forgiving framework

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 18 / 33

Page 112: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Dependencies!

a more stylistically-forgiving framework

start with the strictest possible constraint

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 18 / 33

Page 113: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Dependencies!

a more stylistically-forgiving framework

start with the strictest possible constraint

then slowly relax it

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 18 / 33

Page 114: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Dependencies!

a more stylistically-forgiving framework

start with the strictest possible constraint

then slowly relax it

every example demonstrating a softer constraintdoubles as a counter-example against all previous

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 18 / 33

Page 115: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Strict

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 19 / 33

Page 116: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Strict

seal mark-up into attachments

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 19 / 33

Page 117: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Strict

seal mark-up into attachments, e.g.,

As author of <i> The Satanic Verses </i>, I ...

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 19 / 33

Page 118: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Strict

seal mark-up into attachments, e.g.,

As author of <i> The Satanic Verses </i>, I ...

— just 35.6% agreement with head-percolated trees

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 19 / 33

Page 119: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Loose

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 20 / 33

Page 120: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Loose

allow bracketed head word external dependents

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 20 / 33

Page 121: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Loose

allow bracketed head word external dependents, e.g.,

... the <i> Toronto Star </i> reports ...

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 20 / 33

Page 122: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Loose

allow bracketed head word external dependents, e.g.,

... the <i> Toronto Star </i> reports ...

— already 87.5% agreement with head-percolated trees

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 20 / 33

Page 123: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Sprawl

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 21 / 33

Page 124: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Sprawl

allow all bracketed words external dependents

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 21 / 33

Page 125: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Sprawl

allow all bracketed words external dependents, e.g.,

... the <a> Toronto Star reports ... </a> ...

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 21 / 33

Page 126: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Sprawl

allow all bracketed words external dependents, e.g.,

... the <a> Toronto Star reports ... </a> ...

— now 95.1% agreement with head-percolated trees

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 21 / 33

Page 127: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Tear

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 22 / 33

Page 128: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Tear

fracture by same-side external heads

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 22 / 33

Page 129: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Tear

fracture by same-side external heads, e.g.,

... concession ... has raised eyebrows among those

waiting [PP for <a> Fox News ][PP in Canada ]</a>.

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 22 / 33

Page 130: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Tear

fracture by same-side external heads, e.g.,

... concession ... has raised eyebrows among those

waiting [PP for <a> Fox News ][PP in Canada ]</a>.

— finally, 98.9% agreement with head-percolated trees

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 22 / 33

Page 131: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Summary

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 23 / 33

Page 132: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Summary

remaining 1.1% mostly due to parser errors...

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 23 / 33

Page 133: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Summary

remaining 1.1% mostly due to parser errors...... found one (very rare) true negative disagreement

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 23 / 33

Page 134: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Summary

remaining 1.1% mostly due to parser errors...... found one (very rare) true negative disagreement

a suite of highly (88%, 95%, 99%) accurate constraints

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 23 / 33

Page 135: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Summary

remaining 1.1% mostly due to parser errors...... found one (very rare) true negative disagreement

a suite of highly (88%, 95%, 99%) accurate constraints,... of varying degrees of informativeness

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 23 / 33

Page 136: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Proposed Constraints

Proposed Constraints: Summary

remaining 1.1% mostly due to parser errors...... found one (very rare) true negative disagreement

a suite of highly (88%, 95%, 99%) accurate constraints,... of varying degrees of informativeness

first two can easily guide Viterbi training!

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 23 / 33

Page 137: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Experimental Results

Experimental Results: Dependency Accuracy (%)

Incarnation WSJ10 WSJ∞

(Cohen and Smith, 2009) Brown100(Spitkovsky et al., 2010)(Headden et al., 2009)

BLOG

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 24 / 33

Page 138: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Experimental Results

Experimental Results: Dependency Accuracy (%)

Incarnation WSJ10 WSJ∞

(Cohen and Smith, 2009) 62.0 42.2 Brown100(Spitkovsky et al., 2010) 57.1 45.0 43.6(Headden et al., 2009) 68.8

BLOG

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 24 / 33

Page 139: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Experimental Results

Experimental Results: Dependency Accuracy (%)

Incarnation WSJ10 WSJ∞

(Cohen and Smith, 2009) 62.0 42.2 Brown100(Spitkovsky et al., 2010) 57.1 45.0 43.6(Headden et al., 2009) 68.8

BLOG 69.3 50.4 53.3

state-of-the-art results

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 24 / 33

Page 140: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Experimental Results

Experimental Results: Dependency Accuracy (%)

Incarnation WSJ10 WSJ∞

(Cohen and Smith, 2009) 62.0 42.2 Brown100(Spitkovsky et al., 2010) 57.1 45.0 43.6(Headden et al., 2009) 68.8

BLOG 69.3 50.4 53.3+0.5 +5.4 +9.7

state-of-the-art results

linguistic constraints help with the task!

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 24 / 33

Page 141: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Experimental Results

Experimental Results: Dependency Accuracy (%)

WSJ10 WSJ∞ Brown100BLOG 69.3 50.4 53.3NEWSWEB

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 25 / 33

Page 142: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Experimental Results

Experimental Results: Dependency Accuracy (%)

WSJ10 WSJ∞ Brown100BLOG 69.3 50.4 53.3NEWS 67.3 50.1 51.6WEB

no need to manually clean data!

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 25 / 33

Page 143: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Experimental Results

Experimental Results: Dependency Accuracy (%)

WSJ10 WSJ∞ Brown100BLOG 69.3 50.4 53.3NEWS 67.3 50.1 51.6WEB 64.1 46.3 46.9

no need to manually clean data!

nevertheless, less is more...

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 25 / 33

Page 144: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Experimental Results

Experimental Results: Dependency Accuracy (%)

WSJ10 WSJ∞ Brown100BLOG 69.3 50.4 53.3NEWS 67.3 50.1 51.6WEB 64.1 46.3 46.9

no need to manually clean data!

nevertheless, less is more...

loose constraint consistently delivers best results

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 25 / 33

Page 145: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Experimental Results

Experimental Results: Dependency Accuracy (%)

WSJ10 WSJ∞ Brown100BLOG 69.3 50.4 53.3NEWS 67.3 50.1 51.6WEB 64.1 46.3 46.9

no need to manually clean data!

nevertheless, less is more...

loose constraint consistently delivers best results

requires domain adaptation (re-training on WSJ)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 25 / 33

Page 146: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Experimental Results

Experimental Results: Dependency Accuracy (%)

WSJ10 WSJ∞ Brown100BLOG 69.3 50.4 53.3NEWS 67.3 50.1 51.6WEB 64.1 46.3 46.9

no need to manually clean data!

nevertheless, less is more...

loose constraint consistently delivers best results

requires domain adaptation (re-training on WSJ)

perhaps bigger gains if lexicalized?

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 25 / 33

Page 147: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Experimental Results

Experimental Results: Why Didn’t the Web Help?

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 26 / 33

Page 148: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Experimental Results

Experimental Results: Why Didn’t the Web Help?

language identification

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 26 / 33

Page 149: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Experimental Results

Experimental Results: Why Didn’t the Web Help?

language identification, sentence-breaking

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 26 / 33

Page 150: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Experimental Results

Experimental Results: Why Didn’t the Web Help?

language identification, sentence-breaking

boiler-plate

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 26 / 33

Page 151: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Experimental Results

Experimental Results: Why Didn’t the Web Help?

language identification, sentence-breaking

boiler-plate, POS-tagging:

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 26 / 33

Page 152: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Experimental Results

Experimental Results: Why Didn’t the Web Help?

language identification, sentence-breaking

boiler-plate, POS-tagging:

POS Sequence WEB Count

Sample web sentence, chosen uniformly at random.

1 DT NNS VBN 82,858,487All rights reserved.

2 NNP NNP NNP 65,889,181Yuasa et al.

3 NN IN TO VB RB 31,007,783Sign in to YouTube now!

4 NN IN IN PRP$ JJ NN 31,007,471Sign in with your Google Account!

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 26 / 33

Page 153: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Experimental Results

Experimental Results: Why Didn’t the Web Help?

language identification, sentence-breaking

boiler-plate, POS-tagging:

POS Sequence WEB Count

Sample web sentence, chosen uniformly at random.

1 DT NNS VBN 82,858,487All rights reserved.

2 NNP NNP NNP 65,889,181Yuasa et al.

3 NN IN TO VB RB 31,007,783Sign in to YouTube now!

4 NN IN IN PRP$ JJ NN 31,007,471Sign in with your Google Account!

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 26 / 33

Page 154: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Experimental Results

Experimental Results: Why Didn’t the Web Help?

language identification, sentence-breaking

boiler-plate, POS-tagging:

POS Sequence WEB Count

Sample web sentence, chosen uniformly at random.

1 DT NNS VBN 82,858,487All rights reserved.

2 NNP NNP NNP 65,889,181Yuasa et al.

3 NN IN TO VB RB 31,007,783Sign in to YouTube now!

4 NN IN IN PRP$ JJ NN 31,007,471Sign in with your Google Account!

ambiguous noun phrases: “click here” and “print post”

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 26 / 33

Page 155: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Summary

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 27 / 33

Page 156: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Summary

strong connection between mark-up and syntax

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 27 / 33

Page 157: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Summary

strong connection between mark-up and syntax

state-of-the-art unsupervised dependency parsing

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 27 / 33

Page 158: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Summary

strong connection between mark-up and syntax

state-of-the-art unsupervised dependency parsing

other parsing applications:

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 27 / 33

Page 159: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Summary

strong connection between mark-up and syntax

state-of-the-art unsupervised dependency parsing

other parsing applications:

— supervised parsing (via self-training)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 27 / 33

Page 160: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Summary

strong connection between mark-up and syntax

state-of-the-art unsupervised dependency parsing

other parsing applications:

— supervised parsing (via self-training)

— constituent parsing (via discriminative features)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 27 / 33

Page 161: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Summary

strong connection between mark-up and syntax

state-of-the-art unsupervised dependency parsing

other parsing applications:

— supervised parsing (via self-training)

— constituent parsing (via discriminative features)

— balanced punctuation? (e.g., quotes and parens)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 27 / 33

Page 162: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Potential

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 28 / 33

Page 163: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Potential

another motivating example:

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 28 / 33

Page 164: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Potential

another motivating example:

[NP [NP Libyan ruler] <a>[NP Mu‘ammar al-Qaddafi]</a>]

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 28 / 33

Page 165: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Potential

another motivating example:

[NP [NP Libyan ruler] <a>[NP Mu‘ammar al-Qaddafi]</a>]

— internal structure of a compound(Vadas and Curran, 2007)

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 28 / 33

Page 166: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Potential

another motivating example:

[NP [NP Libyan ruler] <a>[NP Mu‘ammar al-Qaddafi]</a>]

— internal structure of a compound(Vadas and Curran, 2007)

— lower-level tokenization signal

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 28 / 33

Page 167: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Potential

another motivating example:

[NP [NP Libyan ruler] <a>[NP Mu‘ammar al-Qaddafi]</a>]

— internal structure of a compound(Vadas and Curran, 2007)

— lower-level tokenization signal

http://nlp.stanford.edu:8080/parser/

(NP (ADJP (NP (JJ Libyan) (NN ruler))(JJ Mu))

(“ ‘) (NN ammar) (NNS al-Qaddafi))

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 28 / 33

Page 168: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Potential

other structured tasks in NLP:

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 29 / 33

Page 169: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Potential

other structured tasks in NLP:

— NE-tagging

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 29 / 33

Page 170: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Potential

other structured tasks in NLP:

— NE-tagging— NP-chunking

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 29 / 33

Page 171: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Potential

other structured tasks in NLP:

— NE-tagging— NP-chunking— CJK-segmentation

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 29 / 33

Page 172: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Potential

other structured tasks in NLP:

— NE-tagging— NP-chunking— CJK-segmentation— sentence-breaking

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 29 / 33

Page 173: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Potential

other structured tasks in NLP:

— NE-tagging— NP-chunking— CJK-segmentation— sentence-breaking

... and so forth!

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 29 / 33

Page 174: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Open Questions:

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 30 / 33

Page 175: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Open Questions:

does this generalize to other genres?

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 30 / 33

Page 176: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Open Questions:

does this generalize to other genres?

does this generalize to other languages?

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 30 / 33

Page 177: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Open Questions:

does this generalize to other genres?

does this generalize to other languages?

what would be the impact of lexicalization?

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 30 / 33

Page 178: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Open Questions:

does this generalize to other genres?

does this generalize to other languages?

what would be the impact of lexicalization?

are there broader NLP implications?

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 30 / 33

Page 179: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

What We Make Available:

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 31 / 33

Page 180: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

What We Make Available:

all of our cleaned up annotations of the blog

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 31 / 33

Page 181: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

What We Make Available:

all of our cleaned up annotations of the blog

a complete analysis of every annotated sentence

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 31 / 33

Page 182: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

What We Make Available:

all of our cleaned up annotations of the blog

a complete analysis of every annotated sentence

and the best (blog) model

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 31 / 33

Page 183: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

What We Make Available:

all of our cleaned up annotations of the blog

a complete analysis of every annotated sentence

and the best (blog) model

http://cs.stanford.edu/˜valentin/

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 31 / 33

Page 184: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Thanks!

Questions?

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 32 / 33

Page 185: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Proposed Constraints: Exception

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 33 / 33

Page 186: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Proposed Constraints: Exception

remaining 1.1% mostly due to parser errors...

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 33 / 33

Page 187: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Proposed Constraints: Exception

remaining 1.1% mostly due to parser errors...

a (very rare) true negative disagreement:

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 33 / 33

Page 188: Profiting from Mark-Up - Stanford NLP Groupnlp.stanford.edu/pubs/markup-slides.pdf · Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 4 / 33. Overview

Conclusion

Proposed Constraints: Exception

remaining 1.1% mostly due to parser errors...

a (very rare) true negative disagreement:

The French broadcasting authority, <a>CSA, banned... Al-Manar</a> satellite television from ...

Spitkovsky et al. (Stanford & Google) Profiting from Mark-Up ACL (2010-07-14) 33 / 33