lsa07 lecture 2 -- syntax iklein/lsa/lsa07 lecture 2 -- syntax i.pdf · results: cdc from [clark...
TRANSCRIPT
![Page 1: LSA07 lecture 2 -- syntax Iklein/lsa/LSA07 lecture 2 -- syntax I.pdf · Results: CDC From [Clark 00] Ambiguous Words What’s Going On? For phoneme structure HMMs worked well We needed](https://reader033.vdocuments.net/reader033/viewer/2022042102/5e7ec848b16f921c0a0e8ffb/html5/thumbnails/1.jpg)
1
Statistical Grammar InductionLSA 2007
Lecture 2: Syntax IDan Klein – UC Berkeley
Recap Last time
Probabilistic models for acoustics and segmentation Alternating re-estimation for learning (not the only way) Bad model assumptions lead to bad learned structure
Phonemes Phonotactic context not enough to learn natural classes Modeling sequential structure and acoustics gave more interesting
latent structure
Segmentation Assuming uniform prior on words caused trivial solutions A very natural and minimal preference for short words fixed the problem Modeling bigram structure fixed issues of collocational overlexicalization
Bigram Segementations
From [Goldwater 2006]
Unigram Segementations Bigram Segmentations
Some Cognitive Evidence Distributional / transitional cues influence
segmentation of audio streams by children [Saffran et al., 1996] Their evidence is very compatible with these kinds of
models, but is analyzed in a more procedural way They also analyze other possible statistical cues Roger Levy will be discussing their results and the
relation to human child language segmentation in his class on psycholinguistics
More Phonology Learning Phonology and Inductive Bias [Gildea and Jurafsky,
1995] General purpose learning of phonemic to phonetic transcription
fails (or at least one algorithm did) Adding bias allows learning of common rules Biases are very general: faithfulness, community, context
Historical Reconstruction [Bouchard-Cote et al. 2007] Stochastic edit model captures sound changes Models individual lexical items through phylogeny Begin with guesses at missing forms, learn edit model, re-
estimate, repeat…
Let me know if you’d like coverage of these topics…
Learning Lexicons How might we learn to organize lexicons? Syntactic classes (parts-of-speech -- today) Semantic classes / form-meaning mappings Many variants (words to images, words to semantic forms,…) We’ll return to this later in the course
Syntactic class learning Very challenging task Obvious approaches don’t work well Distributional methods very effective
![Page 2: LSA07 lecture 2 -- syntax Iklein/lsa/LSA07 lecture 2 -- syntax I.pdf · Results: CDC From [Clark 00] Ambiguous Words What’s Going On? For phoneme structure HMMs worked well We needed](https://reader033.vdocuments.net/reader033/viewer/2022042102/5e7ec848b16f921c0a0e8ffb/html5/thumbnails/2.jpg)
2
HMMs for POS Tagging Probabilistic induction methodology State reasonable model including both observed and
unobserved variables Inference over unobserved ones (e.g. EM)
Hidden Markov models (HMMs)
T T T T
W W W W
EM for HMMs EM for HMMs Also called the Baum-Welch procedure Requires finding the posterior expected counts of each tag at
each position (and pair of positions), using the forward-backward algorithm
Can still think of hard EM (find best tag sequence, fix it, re-estimate, repeat)
If you simply let EM loose unconstrained on a few million words of data, it will learn a garbage HMM…
Should we be concerned?
Merialdo: Setup
Some (discouraging) experiments [Merialdo 94]
Setup: Know the set of allowable POS tags for each word Fix k training examples to their true labels Learn P(w|t) on these examples Learn P(t|t-1,t-2) on these examples
On n examples, re-estimate with EM
Note: we know allowed tags but not frequencies
Merialdo: Results
Sequence Models?
the president said that the downturn was over
c1 c2 c6c5 c7c3 c4 c8
the president said that the downturn was over
c1 c2 c6c5 c7c3 c4 c8
Distributional Clustering
the __ ofgovernor
sources __ president __ that
sources __ the __ appointed
the __ saidthe __ of
reportedsaid
saidgovernor
presidentpresident
presidentgovernor
saidreported
thea
the president said that the downturn was over
[Finch and Chater 92, Shuetze 93, many others]
![Page 3: LSA07 lecture 2 -- syntax Iklein/lsa/LSA07 lecture 2 -- syntax I.pdf · Results: CDC From [Clark 00] Ambiguous Words What’s Going On? For phoneme structure HMMs worked well We needed](https://reader033.vdocuments.net/reader033/viewer/2022042102/5e7ec848b16f921c0a0e8ffb/html5/thumbnails/3.jpg)
3
Distributional Clustering Three main variants on the same idea: Pairwise similarities and heuristic clustering E.g. [Finch and Chater 92] Produces dendrograms
Vector space methods E.g. [Shuetze 93] Models of ambiguity
Probabilistic methods Various formulations, e.g. [Lee and Pereira 99]
Nearest Neighbors
Dendrograms _ Context Distribution Clustering Basic distributional clustering Characterize each word by its signature = distribution over
contexts (adjacent words) Group together words with similar signatures
Problems Most words have sparse signatures even with a lot of data “Adjacent words” too superficial: consider “a” vs. “an”
Solution [Clark 00] Signatures over adjacent clusters (circular definition!) Start with K+1 clusters (K top words, plus “other”) Move similar words from “other” to K clusters Recompute signatures as clusters change
Results: CDC
From [Clark 00]
Ambiguous Words
What’s Going On? For phoneme structure HMMs worked well We needed more than local context
For part-of-speech HMMs work poorly It seems to help to have only local context
One explanation: phonetic coarticulation is a very linear phenomenon, syntax is not
Another issue: there a lots of correlations an HMM over words can model – like what?
![Page 4: LSA07 lecture 2 -- syntax Iklein/lsa/LSA07 lecture 2 -- syntax I.pdf · Results: CDC From [Clark 00] Ambiguous Words What’s Going On? For phoneme structure HMMs worked well We needed](https://reader033.vdocuments.net/reader033/viewer/2022042102/5e7ec848b16f921c0a0e8ffb/html5/thumbnails/4.jpg)
4
Other Work LOTS more work on learning POS tags Some recent work: Morphology-driven models [Clark 03] Contrastive estimation [Smith and Eisner 05] Bayesian inference better than EM? [Johnson 07]
I’ll skip morphology learning for now (guest lecture on the topic later!)
Grammar Induction Syntactic grammar induction Take a corpus and produce a grammar Hard setting: corpus of sentences only Easy setting: corpus of trees
We’ll talk about the easy setting first! Even when you have trees, getting a really good
grammar isn’t trivial… Supervised parsing, e.g. [Collins 99, Charniak 97,
many others] but we’ll focus on the case where the refinements are induced automatically
Probabilistic Context-Free Grammars
A context-free grammar is a tuple <N, T, S, R> N : the set of non-terminals
Phrasal categories: S, NP, VP, ADJP, etc. Parts-of-speech (pre-terminals): NN, JJ, DT, VB
T : the set of terminals (the words) S : the start symbol
Often written as ROOT or TOP Not usually the sentence non-terminal S
R : the set of rules Of the form X Y1 Y2 … Yk, with X, Yi N Examples: S NP VP, VP VP CC VP Also called rewrites, productions, or local trees
A PCFG adds: A top-down production probability per rule P(Y1 Y2 … Yk | X)
Example: PCFGs
From Michael Collins
Example: PCFGs Problems with PCFGs?
If we do no annotation, these trees differ only in one rule: VP VP PP NP NP PP
Parse will go one way or the other, regardless of words Have to have something in the model be sensitive to finer category structure
if we’re going to disambiguate Of course, you could put all the load on the semantics…
![Page 5: LSA07 lecture 2 -- syntax Iklein/lsa/LSA07 lecture 2 -- syntax I.pdf · Results: CDC From [Clark 00] Ambiguous Words What’s Going On? For phoneme structure HMMs worked well We needed](https://reader033.vdocuments.net/reader033/viewer/2022042102/5e7ec848b16f921c0a0e8ffb/html5/thumbnails/5.jpg)
5
Problems with PCFGs?
What’s different between basic PCFG scores here? What (lexical) correlations need to be scored?
PCFGs PCFGs: Define a language just like CFGs (in principle) Assign probabilities to individual trees
Allow the selection of a best parse for a sentence
Assign probabilities to sentences
Language Models
We often want to place distributions over sentences Think of these models as soft measures of fluency Distinguish between the idea of a distribution over sentences and
the particular ones we end up discussing
Classic solution: n-gram models (we saw variants today)
N-gram models are (weighted) regular languages Natural language is not regular (of course!) … though you’d be surprised at what 5+ gram models trained on
enough data can do
This is a crude (and often useful in system building) model, butthere are also language models with more linguistically plausible structure, e.g. PCFGs
Language Model Samples Unigram:
[fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, quarter] [that, or, limited, the] [] [after, any, on, consistently, hospital, lake, of, of, other, and, factors, raised, analyst, too,
allowed, mexico, never, consider, fall, bungled, davison, that, obtain, price, lines, the, to, sass, the, the, further, board, a, details, machinists, ……]
Bigram: [outside, new, car, parking, lot, of, the, agreement, reached] [although, common, shares, rose, forty, six, point, four, hundred, dollars, from, thirty,
seconds, at, the, greatest, play, disingenuous, to, be, reset, annually, the, buy, out, of, american, brands, vying, for, mr., womack, currently, share, data, incorporated, believe, chemical, prices, undoubtedly, will, be, as, much, is, scheduled, to, conscientious, teaching]
[this, would, be, a, record, november]
PCFG: [this, quarter, ‘s, surprisingly, independent, attack, paid, off, the, risk, involving, IRS, leaders,
and, transportation, prices] [it, could, be, announced, sometime] [mr., toseland, believes, the, average, defense, economy, is, drafted, from, slightly, more,
than, 12, stocks]
Treebank Sentences Treebank Parsing in 20 sec
Assume we have a treebank with coarse parses Can take a grammar right off the trees (doesn’t work well):
Better results by enriching the grammar
ROOT S 1
S NP VP . 1
NP PRP 1
VP VBD ADJP 1
…..
![Page 6: LSA07 lecture 2 -- syntax Iklein/lsa/LSA07 lecture 2 -- syntax I.pdf · Results: CDC From [Clark 00] Ambiguous Words What’s Going On? For phoneme structure HMMs worked well We needed](https://reader033.vdocuments.net/reader033/viewer/2022042102/5e7ec848b16f921c0a0e8ffb/html5/thumbnails/6.jpg)
6
PLURAL NOUN
NOUNDETDET
ADJ
NOUN
NP NP
CONJ
NP PP
Treebank Grammar Scale
Treebank grammars can be enormous! As a set of FSTs, the raw grammar has ~10K states (why?). Better parsers usually make the grammars larger, not smaller.
N-Ary Rules, Grammar States
We often observe grammar rules like
which are not binary
We can keep these rules or assume a more general process:
VP VBD NP PP PP
VP
[VP VBD NP ]
VBD NP PP PP
[VP VBD NP PP ]
PCFGs and Independence
Symbols in a PCFG define independence assumptions:
At any node, the material inside that node is independent of thematerial outside that node, given the label of that node.
Any information that statistically connects behavior inside and outside a node must flow through that node.
NP
S
VPS NP VP
NP DT NN
NP
Non-Independence I
Independence assumptions are often too strong.
Example: the expansion of an NP is highly dependent on the parent of the NP (i.e., subjects vs. objects).
Also: the subject and object expansions are correlated!
11%9%
6%
NP PP DT NN PRP
9% 9%
21%
NP PP DT NN PRP
7%4%
23%
NP PP DT NN PRP
All NPs NPs under S NPs under VP
Breaking Up the Symbols
We can relax independence assumptions by encoding dependencies into the PCFG symbols:
What are the most useful “features” to encode?
Parent annotation[Johnson 98]
Marking possessive NPs
The Game of Designing a Grammar
Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson ’98]
![Page 7: LSA07 lecture 2 -- syntax Iklein/lsa/LSA07 lecture 2 -- syntax I.pdf · Results: CDC From [Clark 00] Ambiguous Words What’s Going On? For phoneme structure HMMs worked well We needed](https://reader033.vdocuments.net/reader033/viewer/2022042102/5e7ec848b16f921c0a0e8ffb/html5/thumbnails/7.jpg)
7
Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson ’98] Head lexicalization [Collins ’99, Charniak ’00]
The Game of Designing a Grammar
Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson ’98] Head lexicalization [Collins ’99, Charniak ’00] Automatic clustering?
The Game of Designing a Grammar
Manual Annotation
Manually split categories NP: subject vs object DT: determiners vs demonstratives IN: sentential vs prepositional
Advantages: Fairly compact grammar Linguistic motivations
Disadvantages: Performance leveled out Manually annotated
86.3Klein & Manning ’0372.6Naïve Treebank GrammarF1Model
Automatic Annotation Induction
Advantages: Automatically learned:
Label all nodes with latent variables.Same number k of subcategoriesfor all categories.
Disadvantages: Grammar gets too large Most categories are
oversplit while others are undersplit.
86.7Matsuzaki et al. ’0586.3Klein & Manning ’03F1Model
Forward
Learning Latent Annotations
EM algorithm:
X1
X2X7X4
X5 X6X3
He was right
.
Brackets are known Base categories are known Only induce subcategories
Just like Forward-Backward for HMMs. Backward
k=16k=8k=4
k=2
k=160
65
70
75
80
85
90
50 250 450 650 850 1050 1250 1450 1650
Total Number of grammar symbols
Par
sing
acc
urac
y (F
1)
Overview Limit of computational resources
![Page 8: LSA07 lecture 2 -- syntax Iklein/lsa/LSA07 lecture 2 -- syntax I.pdf · Results: CDC From [Clark 00] Ambiguous Words What’s Going On? For phoneme structure HMMs worked well We needed](https://reader033.vdocuments.net/reader033/viewer/2022042102/5e7ec848b16f921c0a0e8ffb/html5/thumbnails/8.jpg)
8
Refinement of the DT tag
DT-1 DT-2 DT-3 DT-4
DT DT
Refinement of the DT tag
DT
Hierarchical Refinement Refinement of the , tag Splitting all categories the same amount is
wasteful:
The DT tag revisited
Oversplit?
Adaptive Splitting Want to split complex categories more Idea: split everything, roll back splits which were
least useful
![Page 9: LSA07 lecture 2 -- syntax Iklein/lsa/LSA07 lecture 2 -- syntax I.pdf · Results: CDC From [Clark 00] Ambiguous Words What’s Going On? For phoneme structure HMMs worked well We needed](https://reader033.vdocuments.net/reader033/viewer/2022042102/5e7ec848b16f921c0a0e8ffb/html5/thumbnails/9.jpg)
9
Adaptive Splitting Want to split complex categories more Idea: split everything, roll back splits which were
least useful
Adaptive Splitting Evaluate loss in likelihood from removing each
split =Data likelihood with split reversed
Data likelihood with split No loss in accuracy when 50% of the splits are
reversed.
Adaptive Splitting Results
74
76
78
80
82
84
86
88
90
100 300 500 700 900 1100 1300 1500 1700Total Number of grammar symbols
50% Merging
Hierarchical TrainingFlat Training
89.5With 50% Merging88.4PreviousF1Model 0
5
10
15
20
25
30
35
40
NP
VP PP
ADVP S
ADJP
SBA
R QP
WH
NP
PRN
NX
SIN
V
PRT
WH
PP SQ
CO
NJP
FRAG
NAC UC
P
WH
ADVP INTJ
SBAR
Q
RR
C
WH
AD
JP X
RO
OT
LST
Number of Phrasal Subcategories
0
5
10
15
20
25
30
35
40
NP
VP PP
ADVP S
ADJP
SBA
R QP
WH
NP
PRN
NX
SIN
V
PRT
WH
PP SQ
CO
NJP
FRAG
NAC UC
P
WH
ADVP INTJ
SBAR
Q
RR
C
WH
AD
JP X
RO
OT
LST
PP
VPNP
Number of Phrasal Subcategories
0
5
10
15
20
25
30
35
40
NP
VP PP
ADVP S
ADJP
SBAR Q
P
WH
NP
PRN
NX
SIN
V
PR
T
WH
PP SQ
CO
NJP
FRAG
NAC UC
P
WH
AD
VP INTJ
SBAR
Q
RR
C
WH
ADJP X
RO
OT
LST
XNAC
Number of Phrasal Subcategories
![Page 10: LSA07 lecture 2 -- syntax Iklein/lsa/LSA07 lecture 2 -- syntax I.pdf · Results: CDC From [Clark 00] Ambiguous Words What’s Going On? For phoneme structure HMMs worked well We needed](https://reader033.vdocuments.net/reader033/viewer/2022042102/5e7ec848b16f921c0a0e8ffb/html5/thumbnails/10.jpg)
10
0
10
20
30
40
50
60
70
NN
P JJN
NS
NN
VBN RB
VBG VB VBD CD IN
VBZ
VBP DT
NN
PS CC
JJR
JJS :
PRP
PRP$ MD
RBR W
PPO
SPD
TW
RB
-LR
B- .EX
WP$
WD
T-R
RB- ''
FW RBS T
O $U
H , ``SY
M RP
LS #
TO ,POS
Number of Lexical Subcategories
0
10
20
30
40
50
60
70
NN
P JJN
NS
NN
VBN RB
VBG VB
VBD CD IN
VBZ
VBP DT
NN
PS CC
JJR
JJS :
PRP
PRP$ MD
RBR W
PPO
SPD
TW
RB
-LR
B- .EX
WP$
WD
T-R
RB- ''
FW RBS TO
$U
H , ``SY
M RP
LS #
IN
DT
RBVBx
Number of Lexical Subcategories
0
10
20
30
40
50
60
70
NN
P JJN
NS
NN
VBN RB
VBG VB
VBD CD IN
VBZ
VBP DT
NN
PS CC
JJR
JJS :
PRP
PRP$ MD
RBR W
PPO
SPD
TW
RB
-LR
B- .EX
WP$
WD
T-R
RB- ''
FW RBS TO
$U
H , ``SY
M RP
LS #
NN
NNS
NNPJJ
Number of Lexical Subcategories
Final Results (Accuracy)
80.180.8This Work
-76.3Dubey ‘05GER
76.680.0Chiang et al. ‘02CH
NEN
G
83.486.3This Work
90.190.6This Work
89.690.1Charniak&Johnson ‘05 (generative)
all F1
≤ 40 wordsF1
Coming Up Learning trees from yields alone Classic chunk / merge approaches Obvious approaches (that don’t work) More recent methods that do work
Formal issues Learnability results and what they mean More on nativism vs empiricism