text models continued hmm and pcfgs. recap so far we have discussed 2 different models for text –...
TRANSCRIPT
Text Models Continued
HMM and PCFGs
Recap
• So far we have discussed 2 different models for text– Bag of Words (BOW) where we introduced TF-IDF
• Location of words is not captured
– Colocation (n-grams) • Location of words is captured, relations between words
in a “sliding window” of size n• Problem: small n not realistic enough, for large n no
statistics
Today
• Two models that allow to capture language syntax, with probabilities for words in different part-of-speech
• The first is based on a Markov Chain (Probabilistic Finite State Machine), the second on a Probabilistic Context Free Grammar
Example Application:Part-of-Speech Tagging
• Goal: For every word in a sentence, infer its part-of-speech• Verb, noun, adjective…
• Perhaps we are interested (more) in documents for which W is often the sentence subject?
• Part-of-speech tagging– Useful for ranking– For machine translation– Word-Sense Disambiguation– …
Part-of-Speech Tagging
• Tag this word. This word is a tag.
• He dogs like a flea
• The can is in the fridge
• The sailor dogs me every day
A Learning Problem
• Training set: tagged corpus– Most famous is the Brown Corpus with about 1M
words
– The goal is to learn a model from the training set, and then perform tagging of untagged text
– Performance tested on a test-set
Different Learning Approaches
• Supervised: Training corpus is tagged by humans
• Unsupervised: Training corpus isn’t tagged
• Partly supervised: Training corpus isn’t tagged, but we have a dictionary giving possible tags for each word
Training set and test set
• In a learning problem the common practice is to split the annotated examples to training and test set
• Training set used to “train” the model, e.g. learn parameters
• Test set used to test how well the model is doing
Simple Algorithm• Assign to each word its most popular tag in the training set
• Bag-of-words approach
• Problem: Ignores context
• Dogs, tag will always be tagged as a noun…
• Can will be tagged as a verb
• Still, achieves around 80% correctness for real-life test-sets
• Can we do better?
(HMM) Hidden Markov Model
• Model: sentences are generated by a probabilistic process
• In particular, a Markov Chain whose states correspond to Parts-of-Speech
• Transitions are probabilistic
• In each state a word is outputted– The output word is again chosen probabilistically based on the
state
HMM
• HMM is:– A set of N states– A set of M symbols (words) – A matrix NXN of transition probabilities Ptrans– A vector of size N of initial state probabilities
Pstart– A matrix NXM of emissions probabilities Pout
• “Hidden” because we see only the outputs, not the sequence of states traversed
Example
3 Fundamental Problems
1) Given an observation sequence, find the most likely hidden state sequence This is tagging
2) Compute the probability of a given observationSequence (=sentence)
3) Given a training set find the model that would make the observations most likely
Tagging
• Find the most likely sequence of states that led to an observed output sequence– We know what the output (=sentence) is, want to
track back its generation, i.e. sequence of states taken in the hidden model
– Each state is a tag
• How many possible sequences are there?
Viterbi Algorithm
• Dynamic Programming
• Vt,k is the probability of the most probable state sequence – Generating the first t + 1 observations (X0,..Xt)– And terminating at state k
Viterbi Algorithm
• Dynamic Programming• Vt,k is the probability of the most probable
state sequence: – generating the first t + 1 observations (X0,..Xt)– and terminating at state k
V0,k = Pstart(k)*Pout(k,X0)
Vt,k= Pout(k,Xt)*max{Vt-1k’ *Ptrans(k’,k)}
Finding the path
• Note that we are interested in the most likely path, not only in its probability
• So we need to keep track at each point of the argmax– Combine them to form a sequence
• What about top-k?
Complexity
• O(T*|S|^2)
• Where T is the sequence (=sentence) length, |S| is the number of states (= number of possible tags)
Forward Algorithm
• αt(k) is the probability of seeing the sequence X0…Xt and terminating at state k
Computing the probabilities
α0(k)= Pstart(k)*Pout(k,X0)αt(k)= Pout(k,Xt)*Σk’{αt-1k’ *Ptrans(k’,k)}
P(X0,…Xn)= Σk αn(k)
Learning the HMM probabilities
• Expectation-Maximization Algorithm (Baum-Welsch)
1. Choose initial probabilities2. Compute Eij the expected number of transitions from
i to j while generating the training sequences, for each i,j (see next)
3. Set the probability of transition from i to j to be Eij/ (Σk Eik)4. Similarly for omission probability5. Repeat 2-4 using the new model, until convergence
Forward-backward
• Forward probabilities: αt(k) is the probability of seeing the sequence X0…Xt and terminating at state k• Backward probabilities:
βt(k) is the probability of seeing the sequenceXt+1…Xn given that the Markov process is atstate k at time t.
Computing the probabilitiesForward algorithmα0(k)= Pstart(k)*Pout(k,X0)αt(k)= Pout(k,Xt)*Σk’{αt-1k’ *Ptrans(k’,k)}P(X0,…Xn)= Σk αn(k)
Backward algorithmβt(k) = P(Xt+1…Xn| state at time t is k)βt(k) = Σk’{Ptrans(k,k’)* Pout(k’,Xt+1)* βt+1(k’)}βn+1(k) = 1 for all kP(X0,…Xn)= Σk β 0(k)* Pstart(k)*Pout(k, X0)
Also: P(X0,…Xn)= Σk αt(k)* βt(k)
Estimating Parameters
• The probability Ψij (t) of traversing an edge ij at time t given the observation X:– Let
Ψij (t) = --------------------------
=
Expected number of transitions
• Eij = -----------
And this is the newly used transition probability on the edge from i to j!
Similarly for output probabilities
Accuracy
• Tested experimentally
• Reaches 96-98% for the Brown corpus– Trained on half and tested on the other half
• Compare with the 80% by the trivial algorithm
• The hard cases are few but are very hard..
NLTK
• http://www.nltk.org/
• Natrual Language ToolKit
• Open source python modules for NLP tasks– Including stemming, POS tagging and much more
PCFG
• A PCFG is a tuple: – N is a set of non-terminals:– is a set of terminals– N1 is the start symbol– R is a set of rules– P is the set of probabilities on rules
• We assume PCFG is in Chomsky Norm Form• Parsing algorithms:
– Earley (top-down)– CYK (bottom-up)– …
(,,,,) 1 PRNN
}{ iN
}{ kw
PFSA vs. PCFG
• PFSA can be seen as a special case of PCFG– State non-terminal– Output symbol terminal– Arc context-free rule– Path Parse tree (only right-branch binary tree)
S1 S2 S3
a b
S1
a S2
b S3
ε
PFSA and HMM
HMM Finish
Add a “Start” state and a transition from “Start” to any state in HMM.Add a “Finish” state and a transition from any state in HMM to “Finish”.
Start
The connection between two algorithms
• HMM can (almost) be converted to a PFSA.• PFSA is a special case of PCFG.• Inside-outside is an algorithm for PCFG.Inside-outside algorithm will work for HMM.
• Forward-backward is an algorithm for HMM.In fact, Inside-outside algorithm is the same
as forward-backward when the PCFG is a PFSA.
Forward and backward probabilities
()ti()ti
X1Xt Xn…
o1 onXn+1…
Ot-1
()ti
X1
…Xt-1
Xt
…
Xn
Xn+1
O1
Ot-1
On
Ot ()ti
Backward/forward prob vs. Inside/outside prob
X1
(,) lti
Xt=Ni
Ot OnOt-1O1 Ol
(,) lti()ti ()ti
O1
X1
Xt=Ni
Ot OnOt-1
PFSA:PCFG:
OutsideInsideForward
Backward
(,) qpj
wp wmwp-1w1wq
(,) qpj
Wq+1
N1
Nj
Notation
Inside and outside probabilities
Definitions
• Inside probability: total prob of generating words wp…wq from non-terminal Nj.
• Outside probability: total prob of beginning with the start symbol N1 and generating and all the words outside wp…wq
• When p>q,
jpqN
(,,)(,) (1)(1)1 mqjpqpj wNwPqp
(|)(,) jpqpqj NwPqp
0(,)(,) qpqp jj
Calculating inside probability (CYK algorithm)
(,1)(,)()(,),
1
qddpNNNPqp srsr
sr
q
pd
jj
Nj
Nr Ns
wp wd Wd+1 wq
()(,) kj
j wNPkk
Calculating outside probability (case 1)
(,1)()(,)(,), 1
eqNNNPepqp ggjf
gf
m
qefj
Nj Ng
wp wq Wq+1 we
Nf
N1
w1 wm
Calculating outside probability (case 2)
(1,)()(,)(,),
1
1
peNNNPqeqp gjgf
gf
p
efj
Ng Nj
we Wp-1 Wp wq
Nf
N1
w1 wm
Outside probability
(1,)()(,)
(,1)()(,)(,)
,
1
1
, 1
peNNNPqe
eqNNNPepqp
gjgf
gf
p
ef
ggjf
gf
m
qefj
otherwise
jifmj 0
11(,1)
Probability of a sentence
(,1)() 11 mwP m
kanyforwNPkkwP kj
jjm ()(,)() 1
(,)(,)(,) 1 qpqpNwP jjjpqm
Recap so far
• Inside probability: bottom-up
• Outside probability: top-down using the same chart.
• Probability of a sentence can be calculated in many ways.
Expected counts and update formulae
The probability of a binary rule is used
()
(,1)(,)()(,)
(|,)1
1
1m
q
pdsr
srjj
msrjj
pq wP
qddpNNNPqp
wNNNNP
()
(,1)(,)()(,)
(|,)
(|,)
1
1
1 1
1 11
1
m
q
pdsr
srjj
m
p
m
q
m
p
m
qm
srjjpq
msrjj
wP
qddpNNNPqp
wNNNNP
wNNNNP
(1)
The probability of Nj is used
(,)(,)(,) 1 qpqpNwP jjjpqm
()
(,)(,)
()
(,)(|)
11
11
m
jj
m
mjpq
mjpq wP
qpqp
wP
wNPwNP
()
(,)(,)
(|)
(|)
(|)
11 1
1
11 1
1
m
jjm
p
m
q
msrj
r s
mjpq
m
p
m
q
mj
wP
qpqp
wNNNP
wNP
wNP
(2)
m
p
m
pqjj
m
p
m
pq
q
pdsr
srjj
mj
mjsrj
msrj
qpqp
qddpNNNPqp
wNP
wNNNNPwNNNP
1
1
1
1
11
(,)(,)
(,1)(,)()(,)
(2)
(1)
(|)
(|,)(|)
The probability of a unaryrule is used
()
(,)(,)(,)(|,)
1
11
m
m
h
khjj
mjkj
wP
wwhhhhwusedisNwNP
m
p
m
pqjj
m
h
khjj
mj
mjkj
mjkj
qpqp
wwhhhh
wNP
wNwNPwNwNP
1
1
1
11
(,)(,)
(,)(,)(,)
(2)
(3)
(|)
(|,)(,|)
(3)
Multiple training sentences
ii
m
jjm
p
m
qm
j
Wsentenceforjh
wP
qpqpwNP
()
()
(,)(,)(|)
11 11
(,,)
()
(,1)(,)()(,)
(|,)1
11
1 11
srjf
wP
qddpNNNPqp
wNNNNP
i
m
q
pdsr
srjj
m
p
m
pqm
srjj
(1)
(2)
()
(,,)()
jh
srjfNNNP
ii
iisrj
Inner loop of the Inside-outside algorithm
Given an input sequence and1. Calculate inside probability:
• Base case• Recursive case:
2. Calculate outside probability:• Base case:
• Recursive case:
()(,) kj
j wNPkk
(,1)(,)()(,),
1
qddpNNNPqp srsr
sr
q
pd
jj
otherwise
jifmj 0
11(,1)
(1,)()(,)
(,1)()(,)(,)
,
1
1
, 1
peNNNPqe
eqNNNPepqp
gjgf
gf
p
ef
ggjf
gf
m
qefj
Inside-outside algorithm (cont)
()
(,1)(,)()(,)
(|,)
1
1
1 1
1
m
q
pdsr
srjj
m
p
m
q
msrjj
wP
qddpNNNPqp
wNNNNP
()
(,)(,)(,)(|,)
1
11
m
m
h
khjj
mjkj
wP
wwhhhhwusedisNwNP
3. Collect the counts
4. Normalize and update the parameters
km
jkjm
jkj
kj
k
kjkj
r sm
srjjm
srjj
r s
srj
srjsrj
wusedisNwNP
wusedisNwNP
wNCnt
wNCntwNP
wNNNNP
wNNNNP
NNNCnt
NNNCntNNNP
(|,)
(|,)
()
()()
(|,)
(|,)
()
()()
1
1
1
1
Relation to EM
Relation to EM
• PCFG is a PM (Product of Multi-nominal) Model
• Inside-outside algorithm is a special case of the EM algorithm for PM Models.
• X (observed data): each data point is a sentence w1m.
• Y (hidden data): parse tree Tr.• Θ (parameters):
()
()
kj
srj
wNP
NNNP
Relation to EM (cont)
(,|,)
(,|,)
(,,)*(,|)
(,,)*(,|)()
1
11 1
11
msrjj
msrjj
pq
m
p
m
q
srj
Trmm
srj
Y
srj
wNNNNP
wNNNNP
NNNwTrcountwTrP
NNNYXcountXYPNNNcount
(,|,)
(,,)*(,|)
(,,)*(,|)()
11
11
mjkj
m
h
Tr
kjmm
kj
Y
kj
wusedisNwNP
wNTrwcountwTrP
wNYXcountXYPwNcount
Summary
()ti (1) tj
XtXt+1
Ot
N1
Nr Ns
wp wd Wd+1 wq
Nj
(,) qpj
(,) qpj
(|) 1 iXjXPa ttij
(,|) 1 jXiXwOPb ttktijk
() srj NNNP
() kj wNP
Summary (cont)
• Topology is known:– (states, arcs, output symbols) in HMM– (non-terminals, rules, terminals) in PCFG
• Probabilities of arcs/rules are unknown.
• Estimating probs using EM (introducing hidden data Y)
Converting HMM to PCFG
• Given an HMM=(S, Σ, π, A, B), create a PCFG=(S1, Σ1,S0, R, P) as follows:– S1=– Σ1=– S0=Start– R=
– P:
]},1[,|,,{}{ 0 NjiDDStartN iji
},{ EOSBOS
},,{
}{]},1[,|{0
0
EOSNBOSDwD
NDStartNjiNDNi
kij
ijiji
1()
1()
()
()
()
0
EOSNP
BOSDP
bwDP
DNStartP
aNDNP
i
ijkkij
ii
ijjiji
Path Parse tree
X1 X2XT…
o1 o2oT
XT+1
Start
X1D0
BOS X2D12
o1…
XT
XT+1DT,T+1
ot EOS
Outside probability
(,) qpj
(,) Tpj
(,) Tti
q=T
()ti
(j,i),(p,t)
(,)_ qpji
(,)_ ppji
(,)_ ttji
q=p
(1)() tat jiji
(p,t)
Outside prob for Nj Outside prob for Dij
Inside probability
(,) Tpj
(,) Tti
()ti
q=T
(j,i),(p,t)
(,) qpj
(,)_ ppji
(,)_ ttji
tijob
q=p
(p,t)
(,)_ qpji
Inside prob for Nj Inside prob for Dij
()
(,1)(,)()(,)
(|,)1
1
1m
q
pdsr
srjj
msrjj
pq wP
qddpNNNPqp
wNNNNP
()
(,1)(,)()(,)
(|,)1
1
1T
q
pdjr
jrii
Tjrii
tq oP
qddtNNNPqt
oNNNNP
td
DN
Tqijr
()
(,1)(,)()(,)(|,)
11
T
jrjiji
iT
jijiitT oP
TtttNDNPTtoNDNNP
Renaming: (j,i), (s,j),(p,t),(m,T)
()()
()()(|,)
1
11 t
oP
jbaioNDNNP ij
T
tijkijtT
jijiitT
(|,) 1Tjijii
tT oNDNNP Estimating
m
p
m
pqm
srjjpqm
srjj wNNNNPwNNNNP1 1
11 (|,)(|,)
T
t
T
pqT
jriitqT
jrii ONNNNPONNNNP1 1
11 (|,)(|,)
td
DN
Tqijr
Renaming: (j,i), (s,j),(p,t),(m,T)
T
tij
T
tT
jriitTT
jijii tONNNNPONDNNP11
11 ()(|,)(|,)
(|,) 1Tjijii oNDNNP Estimating
()
(,)(,)(|)
11 11
m
jjm
p
m
qm
j
wP
qpqpwNP
()
(,)(,)(|)
11 11
T
iiT
t
T
qT
i
OP
qtqtONP
td
DN
Tqijr
Renaming: (j,i), (s,j),(p,t),(m,T)
()()
()()(|)
1111 t
OP
iiONP
T
ti
T
ttT
tT
i
(|) 1Ti ONPEstimating
(|)
(|,)(|)
1
11
mj
mjsrj
msrj
wNP
wNNNNPwNNNP
ijT
ti
T
tij
Ti
Tijri
Tjri a
t
t
ONP
ONNNNPONNNP
1
1
1
11
()
()
(|)
(|,)(|)
Renaming: (j,i), (s,j),(w,o),(m,T)
(|) 1Tjri ONNNP Calculating
Calculating
m
p
m
pqjj
m
hhhjj
mkj
qpqp
wwhhhhwwNP
1
11
(,)(,)
(,)(,)(,)(|)
T
t
T
tqii
T
thtjiji
Tkji
qtqt
wOttttOwDP
1
1__
1_
(,)(,)
(,)(,)(,)(|)
tijoji
jijiji
btt
tattt
tq
(,)
(1)()(,)
_
_
Renaming (j,i_j), (s,j),(p,t),(h,t),(m,T),(w,O), (N,D)
T
tij
ht
T
tij
T
tijojiji
T
thtijojiji
Tkji
t
wOt
btat
wObtatOwDP
t
t
1
1
1
11
_
()
(,)()
(1)()
(,)(1)()(|)
(|) 1_
Tkji OwDP