lsa07 lecture 2 -- syntax iklein/lsa/lsa07 lecture 2 -- syntax i.pdf · results: cdc from [clark...

10
1 Statistical Grammar Induction LSA 2007 Lecture 2: Syntax I Dan Klein – UC Berkeley Recap Last time Probabilistic models for acoustics and segmentation Alternating re-estimation for learning (not the only way) Bad model assumptions lead to bad learned structure Phonemes Phonotactic context not enough to learn natural classes Modeling sequential structure and acoustics gave more interesting latent structure Segmentation Assuming uniform prior on words caused trivial solutions A very natural and minimal preference for short words fixed the problem Modeling bigram structure fixed issues of collocational overlexicalization Bigram Segementations From [Goldwater 2006] Unigram Segementations Bigram Segmentations Some Cognitive Evidence Distributional / transitional cues influence segmentation of audio streams by children [Saffran et al., 1996] Their evidence is very compatible with these kinds of models, but is analyzed in a more procedural way They also analyze other possible statistical cues Roger Levy will be discussing their results and the relation to human child language segmentation in his class on psycholinguistics More Phonology Learning Phonology and Inductive Bias [Gildea and Jurafsky, 1995] General purpose learning of phonemic to phonetic transcription fails (or at least one algorithm did) Adding bias allows learning of common rules Biases are very general: faithfulness, community, context Historical Reconstruction [Bouchard-Cote et al. 2007] Stochastic edit model captures sound changes Models individual lexical items through phylogeny Begin with guesses at missing forms, learn edit model, re- estimate, repeat… Let me know if you’d like coverage of these topics… Learning Lexicons How might we learn to organize lexicons? Syntactic classes (parts-of-speech -- today) Semantic classes / form-meaning mappings Many variants (words to images, words to semantic forms,…) We’ll return to this later in the course Syntactic class learning Very challenging task Obvious approaches don’t work well Distributional methods very effective

Upload: others

Post on 21-Mar-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LSA07 lecture 2 -- syntax Iklein/lsa/LSA07 lecture 2 -- syntax I.pdf · Results: CDC From [Clark 00] Ambiguous Words What’s Going On? For phoneme structure HMMs worked well We needed

1

Statistical Grammar InductionLSA 2007

Lecture 2: Syntax IDan Klein – UC Berkeley

Recap Last time

Probabilistic models for acoustics and segmentation Alternating re-estimation for learning (not the only way) Bad model assumptions lead to bad learned structure

Phonemes Phonotactic context not enough to learn natural classes Modeling sequential structure and acoustics gave more interesting

latent structure

Segmentation Assuming uniform prior on words caused trivial solutions A very natural and minimal preference for short words fixed the problem Modeling bigram structure fixed issues of collocational overlexicalization

Bigram Segementations

From [Goldwater 2006]

Unigram Segementations Bigram Segmentations

Some Cognitive Evidence Distributional / transitional cues influence

segmentation of audio streams by children [Saffran et al., 1996] Their evidence is very compatible with these kinds of

models, but is analyzed in a more procedural way They also analyze other possible statistical cues Roger Levy will be discussing their results and the

relation to human child language segmentation in his class on psycholinguistics

More Phonology Learning Phonology and Inductive Bias [Gildea and Jurafsky,

1995] General purpose learning of phonemic to phonetic transcription

fails (or at least one algorithm did) Adding bias allows learning of common rules Biases are very general: faithfulness, community, context

Historical Reconstruction [Bouchard-Cote et al. 2007] Stochastic edit model captures sound changes Models individual lexical items through phylogeny Begin with guesses at missing forms, learn edit model, re-

estimate, repeat…

Let me know if you’d like coverage of these topics…

Learning Lexicons How might we learn to organize lexicons? Syntactic classes (parts-of-speech -- today) Semantic classes / form-meaning mappings Many variants (words to images, words to semantic forms,…) We’ll return to this later in the course

Syntactic class learning Very challenging task Obvious approaches don’t work well Distributional methods very effective

Page 2: LSA07 lecture 2 -- syntax Iklein/lsa/LSA07 lecture 2 -- syntax I.pdf · Results: CDC From [Clark 00] Ambiguous Words What’s Going On? For phoneme structure HMMs worked well We needed

2

HMMs for POS Tagging Probabilistic induction methodology State reasonable model including both observed and

unobserved variables Inference over unobserved ones (e.g. EM)

Hidden Markov models (HMMs)

T T T T

W W W W

EM for HMMs EM for HMMs Also called the Baum-Welch procedure Requires finding the posterior expected counts of each tag at

each position (and pair of positions), using the forward-backward algorithm

Can still think of hard EM (find best tag sequence, fix it, re-estimate, repeat)

If you simply let EM loose unconstrained on a few million words of data, it will learn a garbage HMM…

Should we be concerned?

Merialdo: Setup

Some (discouraging) experiments [Merialdo 94]

Setup: Know the set of allowable POS tags for each word Fix k training examples to their true labels Learn P(w|t) on these examples Learn P(t|t-1,t-2) on these examples

On n examples, re-estimate with EM

Note: we know allowed tags but not frequencies

Merialdo: Results

Sequence Models?

the president said that the downturn was over

c1 c2 c6c5 c7c3 c4 c8

the president said that the downturn was over

c1 c2 c6c5 c7c3 c4 c8

Distributional Clustering

the __ ofgovernor

sources __ president __ that

sources __ the __ appointed

the __ saidthe __ of

reportedsaid

saidgovernor

presidentpresident

presidentgovernor

saidreported

thea

the president said that the downturn was over

[Finch and Chater 92, Shuetze 93, many others]

Page 3: LSA07 lecture 2 -- syntax Iklein/lsa/LSA07 lecture 2 -- syntax I.pdf · Results: CDC From [Clark 00] Ambiguous Words What’s Going On? For phoneme structure HMMs worked well We needed

3

Distributional Clustering Three main variants on the same idea: Pairwise similarities and heuristic clustering E.g. [Finch and Chater 92] Produces dendrograms

Vector space methods E.g. [Shuetze 93] Models of ambiguity

Probabilistic methods Various formulations, e.g. [Lee and Pereira 99]

Nearest Neighbors

Dendrograms _ Context Distribution Clustering Basic distributional clustering Characterize each word by its signature = distribution over

contexts (adjacent words) Group together words with similar signatures

Problems Most words have sparse signatures even with a lot of data “Adjacent words” too superficial: consider “a” vs. “an”

Solution [Clark 00] Signatures over adjacent clusters (circular definition!) Start with K+1 clusters (K top words, plus “other”) Move similar words from “other” to K clusters Recompute signatures as clusters change

Results: CDC

From [Clark 00]

Ambiguous Words

What’s Going On? For phoneme structure HMMs worked well We needed more than local context

For part-of-speech HMMs work poorly It seems to help to have only local context

One explanation: phonetic coarticulation is a very linear phenomenon, syntax is not

Another issue: there a lots of correlations an HMM over words can model – like what?

Page 4: LSA07 lecture 2 -- syntax Iklein/lsa/LSA07 lecture 2 -- syntax I.pdf · Results: CDC From [Clark 00] Ambiguous Words What’s Going On? For phoneme structure HMMs worked well We needed

4

Other Work LOTS more work on learning POS tags Some recent work: Morphology-driven models [Clark 03] Contrastive estimation [Smith and Eisner 05] Bayesian inference better than EM? [Johnson 07]

I’ll skip morphology learning for now (guest lecture on the topic later!)

Grammar Induction Syntactic grammar induction Take a corpus and produce a grammar Hard setting: corpus of sentences only Easy setting: corpus of trees

We’ll talk about the easy setting first! Even when you have trees, getting a really good

grammar isn’t trivial… Supervised parsing, e.g. [Collins 99, Charniak 97,

many others] but we’ll focus on the case where the refinements are induced automatically

Probabilistic Context-Free Grammars

A context-free grammar is a tuple <N, T, S, R> N : the set of non-terminals

Phrasal categories: S, NP, VP, ADJP, etc. Parts-of-speech (pre-terminals): NN, JJ, DT, VB

T : the set of terminals (the words) S : the start symbol

Often written as ROOT or TOP Not usually the sentence non-terminal S

R : the set of rules Of the form X Y1 Y2 … Yk, with X, Yi N Examples: S NP VP, VP VP CC VP Also called rewrites, productions, or local trees

A PCFG adds: A top-down production probability per rule P(Y1 Y2 … Yk | X)

Example: PCFGs

From Michael Collins

Example: PCFGs Problems with PCFGs?

If we do no annotation, these trees differ only in one rule: VP VP PP NP NP PP

Parse will go one way or the other, regardless of words Have to have something in the model be sensitive to finer category structure

if we’re going to disambiguate Of course, you could put all the load on the semantics…

Page 5: LSA07 lecture 2 -- syntax Iklein/lsa/LSA07 lecture 2 -- syntax I.pdf · Results: CDC From [Clark 00] Ambiguous Words What’s Going On? For phoneme structure HMMs worked well We needed

5

Problems with PCFGs?

What’s different between basic PCFG scores here? What (lexical) correlations need to be scored?

PCFGs PCFGs: Define a language just like CFGs (in principle) Assign probabilities to individual trees

Allow the selection of a best parse for a sentence

Assign probabilities to sentences

Language Models

We often want to place distributions over sentences Think of these models as soft measures of fluency Distinguish between the idea of a distribution over sentences and

the particular ones we end up discussing

Classic solution: n-gram models (we saw variants today)

N-gram models are (weighted) regular languages Natural language is not regular (of course!) … though you’d be surprised at what 5+ gram models trained on

enough data can do

This is a crude (and often useful in system building) model, butthere are also language models with more linguistically plausible structure, e.g. PCFGs

Language Model Samples Unigram:

[fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, quarter] [that, or, limited, the] [] [after, any, on, consistently, hospital, lake, of, of, other, and, factors, raised, analyst, too,

allowed, mexico, never, consider, fall, bungled, davison, that, obtain, price, lines, the, to, sass, the, the, further, board, a, details, machinists, ……]

Bigram: [outside, new, car, parking, lot, of, the, agreement, reached] [although, common, shares, rose, forty, six, point, four, hundred, dollars, from, thirty,

seconds, at, the, greatest, play, disingenuous, to, be, reset, annually, the, buy, out, of, american, brands, vying, for, mr., womack, currently, share, data, incorporated, believe, chemical, prices, undoubtedly, will, be, as, much, is, scheduled, to, conscientious, teaching]

[this, would, be, a, record, november]

PCFG: [this, quarter, ‘s, surprisingly, independent, attack, paid, off, the, risk, involving, IRS, leaders,

and, transportation, prices] [it, could, be, announced, sometime] [mr., toseland, believes, the, average, defense, economy, is, drafted, from, slightly, more,

than, 12, stocks]

Treebank Sentences Treebank Parsing in 20 sec

Assume we have a treebank with coarse parses Can take a grammar right off the trees (doesn’t work well):

Better results by enriching the grammar

ROOT S 1

S NP VP . 1

NP PRP 1

VP VBD ADJP 1

…..

Page 6: LSA07 lecture 2 -- syntax Iklein/lsa/LSA07 lecture 2 -- syntax I.pdf · Results: CDC From [Clark 00] Ambiguous Words What’s Going On? For phoneme structure HMMs worked well We needed

6

PLURAL NOUN

NOUNDETDET

ADJ

NOUN

NP NP

CONJ

NP PP

Treebank Grammar Scale

Treebank grammars can be enormous! As a set of FSTs, the raw grammar has ~10K states (why?). Better parsers usually make the grammars larger, not smaller.

N-Ary Rules, Grammar States

We often observe grammar rules like

which are not binary

We can keep these rules or assume a more general process:

VP VBD NP PP PP

VP

[VP VBD NP ]

VBD NP PP PP

[VP VBD NP PP ]

PCFGs and Independence

Symbols in a PCFG define independence assumptions:

At any node, the material inside that node is independent of thematerial outside that node, given the label of that node.

Any information that statistically connects behavior inside and outside a node must flow through that node.

NP

S

VPS NP VP

NP DT NN

NP

Non-Independence I

Independence assumptions are often too strong.

Example: the expansion of an NP is highly dependent on the parent of the NP (i.e., subjects vs. objects).

Also: the subject and object expansions are correlated!

11%9%

6%

NP PP DT NN PRP

9% 9%

21%

NP PP DT NN PRP

7%4%

23%

NP PP DT NN PRP

All NPs NPs under S NPs under VP

Breaking Up the Symbols

We can relax independence assumptions by encoding dependencies into the PCFG symbols:

What are the most useful “features” to encode?

Parent annotation[Johnson 98]

Marking possessive NPs

The Game of Designing a Grammar

Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson ’98]

Page 7: LSA07 lecture 2 -- syntax Iklein/lsa/LSA07 lecture 2 -- syntax I.pdf · Results: CDC From [Clark 00] Ambiguous Words What’s Going On? For phoneme structure HMMs worked well We needed

7

Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson ’98] Head lexicalization [Collins ’99, Charniak ’00]

The Game of Designing a Grammar

Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson ’98] Head lexicalization [Collins ’99, Charniak ’00] Automatic clustering?

The Game of Designing a Grammar

Manual Annotation

Manually split categories NP: subject vs object DT: determiners vs demonstratives IN: sentential vs prepositional

Advantages: Fairly compact grammar Linguistic motivations

Disadvantages: Performance leveled out Manually annotated

86.3Klein & Manning ’0372.6Naïve Treebank GrammarF1Model

Automatic Annotation Induction

Advantages: Automatically learned:

Label all nodes with latent variables.Same number k of subcategoriesfor all categories.

Disadvantages: Grammar gets too large Most categories are

oversplit while others are undersplit.

86.7Matsuzaki et al. ’0586.3Klein & Manning ’03F1Model

Forward

Learning Latent Annotations

EM algorithm:

X1

X2X7X4

X5 X6X3

He was right

.

Brackets are known Base categories are known Only induce subcategories

Just like Forward-Backward for HMMs. Backward

k=16k=8k=4

k=2

k=160

65

70

75

80

85

90

50 250 450 650 850 1050 1250 1450 1650

Total Number of grammar symbols

Par

sing

acc

urac

y (F

1)

Overview Limit of computational resources

Page 8: LSA07 lecture 2 -- syntax Iklein/lsa/LSA07 lecture 2 -- syntax I.pdf · Results: CDC From [Clark 00] Ambiguous Words What’s Going On? For phoneme structure HMMs worked well We needed

8

Refinement of the DT tag

DT-1 DT-2 DT-3 DT-4

DT DT

Refinement of the DT tag

DT

Hierarchical Refinement Refinement of the , tag Splitting all categories the same amount is

wasteful:

The DT tag revisited

Oversplit?

Adaptive Splitting Want to split complex categories more Idea: split everything, roll back splits which were

least useful

Page 9: LSA07 lecture 2 -- syntax Iklein/lsa/LSA07 lecture 2 -- syntax I.pdf · Results: CDC From [Clark 00] Ambiguous Words What’s Going On? For phoneme structure HMMs worked well We needed

9

Adaptive Splitting Want to split complex categories more Idea: split everything, roll back splits which were

least useful

Adaptive Splitting Evaluate loss in likelihood from removing each

split =Data likelihood with split reversed

Data likelihood with split No loss in accuracy when 50% of the splits are

reversed.

Adaptive Splitting Results

74

76

78

80

82

84

86

88

90

100 300 500 700 900 1100 1300 1500 1700Total Number of grammar symbols

50% Merging

Hierarchical TrainingFlat Training

89.5With 50% Merging88.4PreviousF1Model 0

5

10

15

20

25

30

35

40

NP

VP PP

ADVP S

ADJP

SBA

R QP

WH

NP

PRN

NX

SIN

V

PRT

WH

PP SQ

CO

NJP

FRAG

NAC UC

P

WH

ADVP INTJ

SBAR

Q

RR

C

WH

AD

JP X

RO

OT

LST

Number of Phrasal Subcategories

0

5

10

15

20

25

30

35

40

NP

VP PP

ADVP S

ADJP

SBA

R QP

WH

NP

PRN

NX

SIN

V

PRT

WH

PP SQ

CO

NJP

FRAG

NAC UC

P

WH

ADVP INTJ

SBAR

Q

RR

C

WH

AD

JP X

RO

OT

LST

PP

VPNP

Number of Phrasal Subcategories

0

5

10

15

20

25

30

35

40

NP

VP PP

ADVP S

ADJP

SBAR Q

P

WH

NP

PRN

NX

SIN

V

PR

T

WH

PP SQ

CO

NJP

FRAG

NAC UC

P

WH

AD

VP INTJ

SBAR

Q

RR

C

WH

ADJP X

RO

OT

LST

XNAC

Number of Phrasal Subcategories

Page 10: LSA07 lecture 2 -- syntax Iklein/lsa/LSA07 lecture 2 -- syntax I.pdf · Results: CDC From [Clark 00] Ambiguous Words What’s Going On? For phoneme structure HMMs worked well We needed

10

0

10

20

30

40

50

60

70

NN

P JJN

NS

NN

VBN RB

VBG VB VBD CD IN

VBZ

VBP DT

NN

PS CC

JJR

JJS :

PRP

PRP$ MD

RBR W

PPO

SPD

TW

RB

-LR

B- .EX

WP$

WD

T-R

RB- ''

FW RBS T

O $U

H , ``SY

M RP

LS #

TO ,POS

Number of Lexical Subcategories

0

10

20

30

40

50

60

70

NN

P JJN

NS

NN

VBN RB

VBG VB

VBD CD IN

VBZ

VBP DT

NN

PS CC

JJR

JJS :

PRP

PRP$ MD

RBR W

PPO

SPD

TW

RB

-LR

B- .EX

WP$

WD

T-R

RB- ''

FW RBS TO

$U

H , ``SY

M RP

LS #

IN

DT

RBVBx

Number of Lexical Subcategories

0

10

20

30

40

50

60

70

NN

P JJN

NS

NN

VBN RB

VBG VB

VBD CD IN

VBZ

VBP DT

NN

PS CC

JJR

JJS :

PRP

PRP$ MD

RBR W

PPO

SPD

TW

RB

-LR

B- .EX

WP$

WD

T-R

RB- ''

FW RBS TO

$U

H , ``SY

M RP

LS #

NN

NNS

NNPJJ

Number of Lexical Subcategories

Final Results (Accuracy)

80.180.8This Work

-76.3Dubey ‘05GER

76.680.0Chiang et al. ‘02CH

NEN

G

83.486.3This Work

90.190.6This Work

89.690.1Charniak&Johnson ‘05 (generative)

all F1

≤ 40 wordsF1

Coming Up Learning trees from yields alone Classic chunk / merge approaches Obvious approaches (that don’t work) More recent methods that do work

Formal issues Learnability results and what they mean More on nativism vs empiricism