memms/cmms and crfs
DESCRIPTION
MEMMs/CMMs and CRFs. William W. Cohen Sep 22, 2010. Announcements…. Wiki Pages - HowTo. http://malt.ml.cmu.edu/mw/index.php/Social_Media_Analysis_10-802_in_Spring_2010#Other_Resources Example: http://malt.ml.cmu.edu/mw/index.php/Turney,_ACL_2002 Key points Naming the pages – examples: - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/1.jpg)
MEMMs/CMMs and CRFs
William W. Cohen
Sep 22, 2010
![Page 2: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/2.jpg)
ANNOUNCEMENTS…
![Page 3: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/3.jpg)
Wiki Pages - HowTo
• http://malt.ml.cmu.edu/mw/index.php/Social_Media_Analysis_10-802_in_Spring_2010#Other_Resources
• Example: http://malt.ml.cmu.edu/mw/index.php/Turney,_ACL_2002
– Key points• Naming the pages – examples:
– [[Cohen ICML 1995]]
– [[Lin and Cohen ICML 2010]]
– [[Minkov et al IJCAI 2005]]
• Structured links:
– [[AddressesProblem::named entity recognition]]
– [[UsesMethod::absolute discounting]
– [[RelatedPaper::Pang et al ACL 2002]
– [[UsesDataset::Citeseer]]
– [[Category::Paper]]
– [[Category::Problem]]
– [[Category::Method]]
– [[Category::Dataset]]
– Rule of 2: Don’t create a page unless you expect 2 inlinks• A method from a paper that’s not used anywhere else should be described in-line
– No inverse links – but you can emulate these with queries
3
![Page 4: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/4.jpg)
Wiki Pages – HowTo, con’t
• To turn in:– Add them to the wiki
– Add links to them on your user page
– Send me an email with links to each page you want to get graded on
– [I may send back bug reports until people get the hang of this…]
• WhenTo: Three pages by 9/30 at midnight
– Actually 10/1 at dawn is fine.
• Suggestion: – Think of your project and build pages for the dataset, the problem, and the
(baseline) method you plan to use.
4
![Page 5: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/5.jpg)
5
Projects
• Some sample projects– Apply existing method to a new problem
• http://staff.science.uva.nl/~otsur/papers/sarcasmAmazonICWSM10.pdf
– Apply new method to an existing dataset– Build something that might help you in your research
• E.g., Extract names of people (pundits, politicians, …) from political blogs• Classify folksonomy tags as person names, place names, …
• On Wed 9/29 - “turn in”:– One page, covering some subset of:
• What you plan to do with what data• Why you think it’s interesting• Any relevant superpowers you might have• How you plan to evaluate• What techniques you plan to use• What question you want to answer• Who you might work with
– These will be posted on the class web site• On Friday 10/8:
– Similar abstract from each team• Team is (preferably) 2-3 people, but I’m flexible• Main new information: who’s on what team
![Page 6: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/6.jpg)
6
Conditional Markov Models
![Page 7: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/7.jpg)
7
What is a symbol?
Ideally we would like to use many, arbitrary, overlapping features of words.
St -1
St
Ot
St+1
Ot +1
Ot -1
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
…
…
…part of
noun phrase
is “Wisniewski”
ends in
“-ski”
Lots of learning systems are not confounded by multiple, non-independent features: decision trees, neural nets, SVMs, …
![Page 8: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/8.jpg)
Stupid HMM tricks
startPr(red)
Pr(green)Pr(green|green) = 1
Pr(red|red) = 1
Pr(y|x) = Pr(x|y) * Pr(y) / Pr(x)
argmax{y} Pr(y|x) = argmax{y} Pr(x|y) * Pr(y)
= argmax{y} Pr(y) * Pr(x1|y)*Pr(x2|y)*...*Pr(xm|y)
Pr(“I voted for Ralph Nader”|ggggg) =
Pr(g)*Pr(I|g)*Pr(voted|g)*Pr(for|g)*Pr(Ralph|g)*Pr(Nader|g)
![Page 9: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/9.jpg)
9
From NB to Maxent
Zy
yw
docf
docf
k
i
kj
/)Pr(
)|Pr(
ncombinatiok j,th -i )(
0]:doc?1 of jposition at appearsk [word )(,
xjw
ywyZ
xy
k
jk
in wordis where
)|Pr()Pr(1
)|Pr( i
xfi )(0
![Page 10: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/10.jpg)
10
From NB to Maxent
xjw
ywyZ
xy
k
jk
in wordis where
)|Pr()Pr(1
)|Pr( i
xfi )(0
![Page 11: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/11.jpg)
11
What is a symbol?
St -1 S
t
Ot
St+1
Ot +1
Ot -1
identity of word
ends in “-ski”
is capitalized
is part of a noun phrase
is in a list of city names
is under node X in WordNet
is in bold font
is indented
is in hyperlink anchor
…
…
…part of
noun phrase
is “Wisniewski”
ends in
“-ski”
Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history
......),|Pr( ,2,1 tttt ssxs
![Page 12: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/12.jpg)
12
Ratnaparkhi’s MXPOST
• Sequential learning problem: predict POS tags of words.
• Uses MaxEnt model described above.
• Rich feature set.
• To smooth, discard features occurring < 10 times.
![Page 13: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/13.jpg)
13
MXPOST
![Page 14: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/14.jpg)
14
Inference for MENE
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
When will prof Cohen post the notes …
![Page 15: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/15.jpg)
15
Inference for MXPOST
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
When will prof Cohen post the notes …
),|Pr(
)...,,|Pr(
)...,,|Pr()|Pr(
1
1,
1,1
iii
iikii
iii
yxy
yyxy
yyxyxy
(Approx view): find best path, weights are now on arcs from state to state.
![Page 16: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/16.jpg)
16
Inference for MXPOST
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
When will prof Cohen post the notes …
More accurately: find total flow to each node, weights are now on arcs from state to state.
'
11 )',|Pr()'()(y
tttt yYxyYyy
![Page 17: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/17.jpg)
17
Inference for MXPOST
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
When will prof Cohen post the notes …
),|Pr(
)...,,|Pr(
)...,,|Pr()|Pr(
1,2
1,
1,1
iiii
iikii
iii
yyxy
yyxy
yyxyxy
Find best path? tree? Weights are on hyperedges
![Page 18: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/18.jpg)
18
Inference for MxPOST
I
O
iI
iO
When will prof Cohen post the notes …
oI
oO
Beam search is alternative to Viterbi:
at each stage, find all children, score them, and discard all but the top n states
![Page 19: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/19.jpg)
19
Inference for MxPOST
I
O
iI
iO
When will prof Cohen post the notes …
oI
oO
Beam search is alternative to Viterbi:
at each stage, find all children, score them, and discard all but the top n states
oII
oiO
ioI
ioO
ooI
ooO
![Page 20: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/20.jpg)
20
Inference for MxPOST
I
O
iI
iO
When will prof Cohen post the notes …
oI
oO
Beam search is alternative to Viterbi:
at each stage, find all children, score them, and discard all but the top n states
oiI
oiO
ioI
ioO
ooI
ooO
oiiI
oiiO
iooI
iooO
oooI
oooO
![Page 21: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/21.jpg)
21
MXPost results
• State of art accuracy (for 1996)
• Same approach used successfully for several other sequential classification steps of a stochastic parser (also state of art).
• Same (or similar) approaches used for NER by Borthwick, Malouf, Manning, and others.
![Page 22: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/22.jpg)
Frietag, McCallum, Pereira
![Page 23: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/23.jpg)
23
MEMMs
• Basic difference from ME tagging:– ME tagging: previous state is feature of MaxEnt classifier– MEMM: build a separate MaxEnt classifier for each state.
• Can build any HMM architecture you want; eg parallel nested HMM’s, etc.
• Data is fragmented: examples where previous tag is “proper noun” give no information about learning tags when previous tag is “noun”
– Mostly a difference in viewpoint– MEMM does allow possibility of “hidden” states and Baum-
Welsh like training
– Viterbi is the most natural inference scheme
![Page 24: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/24.jpg)
24
MEMM task: FAQ parsing
![Page 25: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/25.jpg)
25
MEMM features
![Page 26: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/26.jpg)
26
MEMMs
![Page 27: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/27.jpg)
27
Conditional Random Fields
![Page 28: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/28.jpg)
Implications of the MEMM model
• Does this do what we want?• Q: does Y[i-1] depend on X[i+1] ?
– “a nodes is conditionally independent of its non-descendents given its parents”
• Q: what is Y[0] for the sentence “Qbbzzt of America Inc announced layoffs today in …”
![Page 29: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/29.jpg)
Inference for MXPOST
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
When will prof Cohen post the notes …
),|Pr(
)...,,|Pr(
)...,,|Pr()|Pr(
1
1,
1,1
iii
iikii
iii
yxy
yyxy
yyxyxy
(Approx view): find best path, weights are now on arcs from state to state.
![Page 30: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/30.jpg)
Inference for MXPOST
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
When will prof Cohen post the notes …
More accurately: find total flow to each node, weights are now on arcs from state to state.
'
11 )',|Pr()'()(y
tttt yYxyYyy
Flow out of a node is always fixed:
y
tt yYxyYy 1)',|Pr(,' 1
![Page 31: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/31.jpg)
Label Bias Problem
• Consider this MEMM, and enough training data to perfectly model it:
Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3= 0.5 * 1 * 1
Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’= 0.5 * 1 *1
Pr(0123|rib)=1
Pr(0453|rob)=1
![Page 32: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/32.jpg)
How important is label bias?
• Could be avoided in this case by changing structure:
• Our models are always wrong – is this “wrongness” a problem?
• See Klein & Manning’s paper for more on this….
![Page 33: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/33.jpg)
Another view of label bias [Sha & Pereira]
So what’s the alternative?
![Page 34: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/34.jpg)
Inference for MXPOST
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
When will prof Cohen post the notes …
More accurately: find total flow to each node, weights are now on arcs from state to state.
'
11 )',|Pr()'()(y
tttt yYxyYyy
Flow out of a node is always fixed:
y
tt yYxyYy 1)',|Pr(,' 1
![Page 35: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/35.jpg)
Another max-flow scheme
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
When will prof Cohen post the notes …
More accurately: find total flow to each node, weights are now on arcs from state to state.
'
11 )',|Pr()'()(y
tttt yYxyYyy
Flow out of a node is always fixed:
y
tt yYxyYy 1)',|Pr(,' 1
![Page 36: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/36.jpg)
Another max-flow scheme: MRFs
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
When will prof Cohen post the notes …
Goal is to learn how to weight edges in the graph:
• weight(yi,yi+1) = 2*[(yi=B or I) and isCap(xi)] + 1*[(yi=B and
isFirstName(xi)] - 5*[(yi+1≠B and isLower(xi) and isUpper(xi+1)]
![Page 37: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/37.jpg)
Another max-flow scheme: MRFs
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
B
I
O
When will prof Cohen post the notes …
Find total flow to each node, weights are now on edges from state to state. Goal is to learn how to weight edges in the graph, given features from the examples.
![Page 38: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/38.jpg)
CRFs vs MEMMs
• MEMMs:– Sequence classification
f:xy is reduced to many cases of ordinary classification, f:xiyi
– …combined with Viterbi or beam search
• CRFs:– Sequence classification
f:xy is done by:• Converting x,Y to a MRF
• Using “flow” computations on the MRF to compute the best y|x
x1 x2 x3 x4 x5 x6
Pr(Y|x4,y3)Pr(Y|x5,y5)Pr(Y|x2,y1)
Pr(Y|x2,y1’)
y1 y2 y3 y4 y5 y6
…
… …
x1 x2 x3 x4 x5 x6
MRF: φ(Y1,Y2), φ(Y2,Y3),….
y1 y2 y3 y4 y5 y6
![Page 39: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/39.jpg)
The math: Review of maxent
'
)(0
))',(exp(
)),(exp()|Pr(
)),(exp(),Pr(
))(exp()Pr(
y iii
iii
iii
iii
i
xf
yxf
yxfxy
yxfyx
xfx i
![Page 40: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/40.jpg)
Review of maxent/MEMM/CMMs
j j
ijjjii
jjjjnn
iii
y iii
iii
xZ
yyxfxyyxxyy
xZ
yxf
yxf
yxfxy
)(
)),,(exp()|Pr()...|...Pr(
:MEMMfor
)(
)),(exp(
))',(exp(
)),(exp()|Pr(
1
,111
'
We know how to compute this.
![Page 41: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/41.jpg)
Details on CMMs
j j
ijjjii
jjjjnn xZ
yyxfxyyxxyy
)(
)),,(exp()|Pr()...|...Pr(
1
,111
jjjjii
jj
ijjjii
jj
ijjjii
j
yyxfyxFxZ
yyxF
xZ
yyxf
),,(),( where,)(
)),,(exp(
)(
)),,(exp(
1
1
1
![Page 42: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/42.jpg)
From CMMs to CRFs
jjjjii
jj
iii
jj
ijjjii
j
yyxfyxFxZ
yxF
xZ
yyxf
),,(),( where,)(
)),(exp(
)(
)),,(exp(
1
1
Recall why we’re unhappy: we don’t want local normalization
)(
)),(exp(
xZ
yxFi
ii
New model How to compute this?
![Page 43: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/43.jpg)
What’s the new model look like?
)(
),,(exp(
)(
)),(exp( 1
xZ
yyxf
xZ
yxFi j
jjjii
iii
x1 x2 x3
y1 y2 y3
What’s independent? If fi is HMM-like and depends on only xj,yj or yj,yj-1
![Page 44: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/44.jpg)
What’s the new model look like?
)(
),,(exp(
)(
)),(exp( 1
xZ
yyxf
xZ
yxFi j
jjii
iii
x
y1 y2 y3
What’s independent now??
![Page 45: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/45.jpg)
CRF learning – from Sha & Pereira
![Page 46: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/46.jpg)
CRF learning – from Sha & Pereira
![Page 47: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/47.jpg)
CRF learning – from Sha & Pereira
Something like forward-backward
Idea:
• Define matrix of y,y’ “affinities” at stage i
• Mi[y,y’] = “unnormalized probability” of transition from y to y’ at stage I
• Mi * Mi+1 = “unnormalized probability” of any path through stages i and i+1
![Page 48: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/48.jpg)
x
y1 y2 y3
y1 y2 y3
![Page 49: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/49.jpg)
Forward backward ideas
name
nonName
name
nonName
name
nonName
a
b c
d
e
f g
h
......
bhafbgae
hg
fe
dc
ba
![Page 50: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/50.jpg)
CRF learning – from Sha & Pereira
![Page 51: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/51.jpg)
Sha & Pereira results
CRF beats MEMM (McNemar’s test); MEMM probably beats voted perceptron
![Page 52: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/52.jpg)
Sha & Pereira results
in minutes, 375k examples
![Page 53: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/53.jpg)
Klein & Manning: Conditional Structure vs Estimation
![Page 54: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/54.jpg)
Task 1: WSD (Word Sense Disambiguation)
Bush’s election-year ad campaign will begin this summer, with... (sense1)
Bush whacking is tiring but rewarding—who wants to spend all their time on marked trails? (sense2)
Class is sense1/sense2, features are context words.
![Page 55: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/55.jpg)
Task 1: WSD (Word Sense Disambiguation)
Model 1: Naive Bayes multinomial model:
Use conditional rule to predict sense s from context-word observations o. Standard NB training maximizes “joint likelihood” under independence assumption
![Page 56: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/56.jpg)
Task 1: WSD (Word Sense Disambiguation)
Model 2: Keep same functional form, but maximize conditional likelihood (sound familiar?)
or maybe SenseEval score:
or maybe even:
![Page 57: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/57.jpg)
In other words…
else0
if1),( where
}{}{ featureson depends )|Pr(
,,
,,
yYxXYXf
ffXY
jyxj
yxji
),(exp( )|Pr( YXfXYi
ii
Naïve Bayes MaxEnt
)|Pr(,, yYxX jyxj )X|Pr(Y maximize
chosen to tt
i
Different “optimization goals”…
… or, dropping a constraint about f’s and λ’s
![Page 58: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/58.jpg)
Task 1: WSD (Word Sense Disambiguation)
• Optimize JL with std NB learning• Optimize SCL, CL with conjugate gradient
– Also over “non-deficient models” (?) using Lagrange penalties to enforce “soft” version of deficiency constraint
– I think this makes sure non-conditional version is a valid probability
• “Punt” on optimizing accuracy• Penalty for extreme predictions in SCL
![Page 59: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/59.jpg)
![Page 60: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/60.jpg)
Conclusion: maxent beats NB?All generalizations are wrong?
![Page 61: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/61.jpg)
Task 2: POS Tagging
• Sequential problem• Replace NB with HMM model.• Standard algorithms maximize joint likelihood
• Claim: keeping the same model but maximizing conditional likelihood leads to a CRF– Is this true?
• Alternative is conditional structure (CMM)
![Page 62: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/62.jpg)
else0
' if1),( and
else0
if1),( where
}{}{}{ featureson depends )|Pr(
1',,
,,
',,,,
yYyYYXf
yYxXYXf
fffXY
jjyyj
t
jjyxj
s
yyjt
yxjs
i
)),(exp( )|Pr(||
1
YXfXYi
ii
X
j
HMM CRF
)|'Pr(
)|Pr(
1',,
,,
yYyY
yYxX
jjyyj
jjyxj
)X|YPr( maximize
chosen to tt
i
)),(exp( )|Pr( is,that ,*,*
,*,*,*,*
||
1
YXfXYji
jj
X
j
![Page 63: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/63.jpg)
Using conditional structure vs maximizing conditional likelihood
CMM factors Pr(s,o) into Pr(s|o)Pr(o).
For the CMM model, adding dependencies btwn observations does not change Pr(s|o), ie JL estimate =CL estimate for Pr(s|o)
![Page 64: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/64.jpg)
Task 2: POS Tagging
Experiments with a simple feature set:
For fixed model, CL is preferred to JL (CRF beats HMM)
For fixed objective, HMM is preferred to MEMM/CMM
![Page 65: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/65.jpg)
Error analysis for POS tagging
• Label bias is not the issue:– state-state dependencies are weak compared to
observation-state dependencies– too much emphasis on observation, not enough
on previous states (“observation bias”)
– put another way: label bias predicts overprediction of states with few outgoing transitions, of more generally, low entropy...
![Page 66: MEMMs/CMMs and CRFs](https://reader036.vdocuments.net/reader036/viewer/2022062500/5681598d550346895dc6d538/html5/thumbnails/66.jpg)
Error analysis for POS tagging