1 hmm (i) ling 570 fei xia week 7: 11/5-11/7/07. 2 hmm definition and properties of hmm –two types...
Post on 20-Dec-2015
260 views
TRANSCRIPT
1
HMM (I)
LING 570
Fei Xia
Week 7: 11/5-11/7/07
2
HMM
• Definition and properties of HMM– Two types of HMM
• Three basic questions in HMM
3
Definition of HMM
4
Hidden Markov Models• There are n states s1, …, sn in an HMM, and the states are
connected.
• The output symbols are produced by the states or edges in HMM.
• An observation O=(o1, …, oT) is a sequence of output symbols.
• Given an observation, we want to recover the hidden state sequence.
• An example: POS tagging– States are POS tags– Output symbols are words– Given an observation (i.e., a sentence), we want to discover the tag
sequence.
5
Same observation, different state sequences
V DTP N
time flies like an arrow
N N DTV N
time flies like an arrow
N
6
Two types of HMMs
• State-emission HMM (Moore machine):– The output symbol is produced by states:
• By the from-state• By the to-state
• Arc-emission HMM (Mealy machine): – The output symbol is produce by the edges; i.e.,
by the (from-state, to-state) pairs.
7
PFA recap
8
Formal definition of PFA
A PFA is • Q: a finite set of N states• Σ: a finite set of input symbols• I: Q R+ (initial-state probabilities)• F: Q R+ (final-state probabilities)• : the transition relation
between states.• P: (transition probabilities)
),,,,,( PFIQ
QQ }){(
R
9
qI 1)(
a
qaqPqFQq
'}{
1)',,()(
1,1
),()(
),,(*)(*)(),(
1,1,1,1
11
111,1,1
nqnnn
ii
n
iinnn
qwPwP
qwqpqFqIqwP
Constraints on function:
Probability of a string:
10
An example of PFA
q0:0 q1:0.2
b:0.8
a:1.0
I(q0)=1.0I(q1)=0.0
P(abn)=I(q0)*P(q0,abn,q1)*F(q1) =1.0 * 1.0*0.8n *0.2
18.01
8.0*2.08.0*2.0)()(
0
00
n
n
n
n
x
abPxP
F(q0)=0F(q1)=0.2
11
Arc-emission HMM
12
Definition of arc-emission HMM
• A HMM is a tuple :
– A set of states S={s1, s2, …, sN}.
– A set of output symbols Σ={w1, …, wM}.
– Initial state probabilities
– Transition prob: A={aij}.
– Emission prob: B={bijk}
}{ i
),,,,( BAS
13
Constraints in an arc-emission HMM
M
kijk
N
jij
N
ii
b
a
1
1
1
1
1
1For any integer n and any HMM
nO
HMMOP||
1)|(
14
An example: HMM structure
s1 s2sN…
w1
w5
Same kinds of parameters but the emission probabilities depend on both states: P(wk | si, sj)
# of Parameters: O(N2M + N2).
w4
w3
w2
w1
w1
15
A path in an arc emission HMM
o1 onX1 X2
Xn…o2 Xn+1
State sequence: X1,n+1
Output sequence: O1,n
1,1
),()(
),|()|()(),(
1,1,1,1
11
111,1,1
nXnnn
iiii
n
iinn
XOPOP
xxoPxxPxXOP
16
PFA vs. Arc-emission HMMA PFA is • Q: a finite set of N states• Σ: a finite set of input symbols• I: Q R+ (initial-state probabilities)• F: Q R+ (final-state probabilities)• : the transition relation between states.• P: (transition probabilities)
A HMM is a tuple :– A set of states S={s1, s2, …, sN}.– A set of output symbols Σ={w1, …, wM}.– Initial state probabilities – Transition prob: A={aij}.– Emission prob: B={bijk}
}{ i
),,,,( BAS
),,,,,( PFIQ
QQ }){( R
17
State-emission HMM
18
Definition of state-emission HMM
• A HMM is a tuple :
– A set of states S={s1, s2, …, sN}.
– A set of output symbols Σ={w1, …, wM}.
– Initial state probabilities
– Transition prob: A={aij}.
– Emission prob: B={bjk}
• We use si and wk to refer to what is in an HMM structure.
• We use Xi and Oi to refer to what is in a particular HMM path and its output
}{ i
),,,,( BAS
19
Constraints in a state-emission HMM
M
kik
N
jij
N
ii
b
a
1
1
1
1
1
1For any integer n and any HMM
nO
HMMOP||
1)|(
20
An example: the HMM structure
Two kinds of parameters:• Transition probability: P(sj
| si)• Emission probability: P(wk | si) # of Parameters: O(NM+N2)
w1 w2 w1
s1 s2sN…
w5w3w1
21
Output symbols are generated by the from-states
• State sequence: X1,n
• Output sequence: O1,n
nXnnn
ii
n
ii
n
iinn
XOPOP
xoPxxPxXOP
,1
),()(
))|(())|(()(),(
,1,1,1
1
1
111,1,1
o1on
X1 X2Xn…
o2
22
Output symbols are generated by the to-states
• State sequence: X1,n+1
• Output sequence: O1,n
1,1
),()(
))|()|(()(),(
1,1,1,1
11
111,1,1
nXnnn
iii
n
iinn
XOPOP
xoPxxPxXOP
o1on
X2 X3Xn+1…
o2
X1
23
A path in a state-emission HMM
o1on
X1 X2Xn…
o2
o1on
X2 X3Xn+1…
o2
X1
Output symbols are produced by the from-states:
Output symbols are produced by the to-states:
24
Arc-emission vs. state-emission
o1on
X2 X3Xn+1…
o2
X1
o1 onX1 X2
Xn…o2 Xn+1
25
Properties of HMM
• Markov assumption (Limited horizon):
• Stationary distribution (Time invariance): the probabilities do not change over time:
• The states are hidden because we know the structure of the machine (i.e., S and Σ), but we don’t know which state sequences generate a particular output.
)|(),...,|( 1211 tttt XXPXXXXP
)|()|( 11 mtmttt XXPXXP
26
Are the two types of HMMs equivalent?
• For each state-emission HMM1, there is an arc-emission HMM2, such that for any sequence O, P(O|HMM1)=P(O|HMM2).
• The reverse is also true.
• How to prove that?
27
Applications of HMM
• N-gram POS tagging– Bigram tagger: oi is a word, and si is a POS tag.
• Other tagging problems: – Word segmentation– Chunking– NE tagging– Punctuation predication– …
• Other applications: ASR, ….
28
Three HMM questions
29
Three fundamental questions for HMMs
• Training an HMM: given a set of observation sequences, learn its distribution, i.e. learn the transition and emission probabilities
• HMM as a parser: Finding the best state sequence for a given observation
• HMM as an LM: compute the probability of a given observation
30
Training an HMM: estimating the probabilities
• Supervised learning:– The state sequences in the training data are
known– ML estimation
• Unsupervised learning:– The state sequences in the training data are
unknown– forward-backward algorithm
31
HMM as a parser
32
HMM as a parser: Finding the best state sequence
• Given the observation O1,T=o1…oT, find the state sequence X1,T+1=X1 … XT+1 that maximizes P(X1,T+1 | O1,T).
Viterbi algorithm
X1 X2XT…
o1 o2oT
XT+1
33
“time flies like an arrow”\init BOS 1.0
\transition BOS N 0.5 BOS DT 0.4 BOS V 0.1
DT N 1.0
N N 0.2 N V 0.7 N P 0.1
V DT 0.4 V N 0.4 V P 0.1 V V 0.1
P DT 0.6 P N 0.4
\emission N time 0.1 V time 0.1 N flies 0.1 V flies 0.2 V like 0.2 P like 0.1 DT an 0.3 N arrow 0.1
34
Finding all the paths: to build the trellis
time flies like an arrow
N N N N
V V V V
P P P P
DT DT DT DT
BOS
N
V
P
DT
35
Finding all the paths (cont)
time flies like an arrow
N N N N
V V V V
P P P P
DT DT DT DT
BOS
N
V
P
DT
36
Viterbi algorithm
The probability of the best path that produces O1,t-1 while ending up in state sj:
),,(max)( 1,11,11,1
jXOXPt tttX
def
jt
jj )1(
tijoijii
j batt )(max)1(
Initialization:
Induction:
Modify it to allow ²-emission
37
Proof of the recursive function
tt
tt
t
ttt
tt
t
t
ijoijiiX
tttX
ijoijiX
ijoijtttXiX
ttttttttXiX
tttttX
tttX
j
bat
iXOXPba
baiXOXP
iXOXjXoPiXOXP
jXXOOXP
jXOXPt
)(max
)),,(max(max
),,(maxmax
),,|,(),,(maxmax
),,,,(max
),,(max)1(
1,11,1
1,11,1
1,11,111,11,1
11,11,1
1,1,1
1,1
1,1
1,1
,1
,1
38
Viterbi algorithm: calculating ±j(t) # N is the number of states in the HMM structure# observ is the observation O, and leng is the length of
observ.
Initialize viterbi[0..leng] [0..N-1] to 0 for each state j viterbi[0] [j] = ¼[j] back-pointer[0] [j] = -1 # dummy
for (t=0; t<leng; t++) for (j=0; j<N; j++) k=observ[t] # the symbol at time t viterbi[t+1] [j] = maxi viterbi[t] [i] aij bjk
back-pointer[t+1] [j] = arg maxi viterbi[t] [i] aij bjk
39
Viterbi algorithm: retrieving the best path
# find the best pathbest_final_state = arg maxj viterbi[leng] [j]
# start with the last state in the sequencej = best_final_state
push(arr, j);
for (t=leng; t>0; t--) i = back-pointer[t] [j] push(arr, i) j = i
return reverse(arr)
40
Hw7 and Hw8
• Hw7: write an HMM “class”:– Read HMM input file– Output HMM
• Hw8: implement the algorithms for two HMM tasks:– HMM as parser: Viterbi algorithm– HMM as LM: the prob of an observation
41
Implementation issue storing HMM
Approach #1: • ¼i: pi {state_str}• aij: a {from_state_str} {to_state_str}• bjk: b {state_str} {symbol}
Approach #2: • state2idx{state_str} = state_idx• symbol2idx{symbol_str} = symbol_idx
• ¼i: pi [state_idx] = prob• aij: a [from_state_idx] [to_state_idx] = prob• bjk: b [state_idx] [symbol_idx] = prob
• idx2state[state_idx] = state_str• Idx2symbol[symbol_idx] = symbol_str
42
Storing HMM: sparse matrix
• aij: a [i] [j] = prob• bjk: b [j] [k] = prob
• aij: a[i] = “j1 p1 j2 p2 …”• aij: a[j] = “i1 p1 i2 p2 …”
• bjk: b[j] = “k1 p1 k2 p2 ….” • bjk: b[k] = “j1 p1 j2 p2 …”
43
Other implementation issues
• Index starts from 0 in programming, but often starts from 1 in algorithms
• The sum of logprob is used in practice to replace the product of prob.
• Check constraints and print out warning if the constraints are not met.
44
HMM as LM
45
HMM as an LM: computing P(o1, …, oT)
1st try: - enumerate all possible paths - add the probabilities of all paths
46
Forward probabilities
• Forward probability: the probability of producing O1,t-1 while ending up in state si:
),()( 1,1 iXOPt tt
def
i
)1()(1
TOPN
ii
47
Calculating forward probability
tijoiji
i
ttj
bat
jXOPt
)(
),()1( 1,1
jj )1(Initialization:
Induction:
48
tijoiji
i
ttttti
tttttti
tti
t
ttj
bat
iXjXoPiXOP
iXOjXoPiXOP
jXiXOP
jXOPt
)(
)|,(*),(
),|,(*),(
),,(
),()1(
11,1
1,111,1
1,1
1,1
49
Summary
• Definition: hidden states, output symbols
• Properties: Markov assumption
• Applications: POS-tagging, etc.
• Three basic questions in HMM– Find the probability of an observation: forward probability– Find the best sequence: Viterbi algorithm– Estimate probability: MLE
• Bigram POS tagger: decoding with Viterbi algorithm