1 hmm (i) ling 570 fei xia week 7: 11/5-11/7/07. 2 hmm definition and properties of hmm –two types...

1

HMM (I)

LING 570

Fei Xia

Week 7: 11/5-11/7/07

2

HMM

• Definition and properties of HMM– Two types of HMM

• Three basic questions in HMM

3

Definition of HMM

4

Hidden Markov Models• There are n states s1, …, sn in an HMM, and the states are

connected.

• The output symbols are produced by the states or edges in HMM.

• An observation O=(o1, …, oT) is a sequence of output symbols.

• Given an observation, we want to recover the hidden state sequence.

• An example: POS tagging– States are POS tags– Output symbols are words– Given an observation (i.e., a sentence), we want to discover the tag

sequence.

5

Same observation, different state sequences

V DTP N

time flies like an arrow

N N DTV N


N

6

Two types of HMMs

• State-emission HMM (Moore machine):– The output symbol is produced by states:

• By the from-state• By the to-state

• Arc-emission HMM (Mealy machine): – The output symbol is produce by the edges; i.e.,

by the (from-state, to-state) pairs.

7

PFA recap

8

Formal definition of PFA

A PFA is • Q: a finite set of N states• Σ: a finite set of input symbols• I: Q R+ (initial-state probabilities)• F: Q R+ (final-state probabilities)• : the transition relation

between states.• P: (transition probabilities)

),,,,,( PFIQ

QQ }){(

R

9

Qq

qI 1)(

Qq

a

qaqPqFQq

'}{

1)',,()(

1,1

),()(

),,(*)(*)(),(

1,1,1,1

11

111,1,1

nqnnn

ii

n

iinnn

qwPwP

qwqpqFqIqwP

Constraints on function:

Probability of a string:

10

An example of PFA

q0:0 q1:0.2

b:0.8

a:1.0

I(q0)=1.0I(q1)=0.0

P(abn)=I(q0)*P(q0,abn,q1)*F(q1) =1.0 * 1.0*0.8n *0.2

18.01

8.0*2.08.0*2.0)()(

0

00

n

n

n

n

x

abPxP

F(q0)=0F(q1)=0.2

11

Arc-emission HMM

12

Definition of arc-emission HMM

• A HMM is a tuple :

– A set of states S={s1, s2, …, sN}.

– A set of output symbols Σ={w1, …, wM}.

– Initial state probabilities

– Transition prob: A={aij}.

– Emission prob: B={bijk}

}{ i

),,,,( BAS

13

Constraints in an arc-emission HMM

M

kijk

N

jij

N

ii

b

a

1

1

1

1

1

1For any integer n and any HMM

nO

HMMOP||

1)|(

14

An example: HMM structure

s1 s2sN…

w1

w5

Same kinds of parameters but the emission probabilities depend on both states: P(wk | si, sj)

# of Parameters: O(N2M + N2).

w4

w3

w2

w1

w1

15

A path in an arc emission HMM

o1 onX1 X2

Xn…o2 Xn+1

State sequence: X1,n+1

Output sequence: O1,n

1,1

),()(

),|()|()(),(

1,1,1,1

11

111,1,1

nXnnn

iiii

n

iinn

XOPOP

xxoPxxPxXOP

16

PFA vs. Arc-emission HMMA PFA is • Q: a finite set of N states• Σ: a finite set of input symbols• I: Q R+ (initial-state probabilities)• F: Q R+ (final-state probabilities)• : the transition relation between states.• P: (transition probabilities)

A HMM is a tuple :– A set of states S={s1, s2, …, sN}.– A set of output symbols Σ={w1, …, wM}.– Initial state probabilities – Transition prob: A={aij}.– Emission prob: B={bijk}

}{ i

),,,,( BAS

),,,,,( PFIQ

QQ }){( R

17

State-emission HMM

18

Definition of state-emission HMM

• A HMM is a tuple :

– A set of states S={s1, s2, …, sN}.

– A set of output symbols Σ={w1, …, wM}.

– Initial state probabilities

– Transition prob: A={aij}.

– Emission prob: B={bjk}

• We use si and wk to refer to what is in an HMM structure.

• We use Xi and Oi to refer to what is in a particular HMM path and its output

}{ i

),,,,( BAS

19

Constraints in a state-emission HMM

M

kik

N

jij

N

ii

b

a

1

1

1

1

1

1For any integer n and any HMM

nO

HMMOP||

1)|(

20

An example: the HMM structure

Two kinds of parameters:• Transition probability: P(sj

| si)• Emission probability: P(wk | si) # of Parameters: O(NM+N2)

w1 w2 w1

s1 s2sN…

w5w3w1

21

Output symbols are generated by the from-states

• State sequence: X1,n

• Output sequence: O1,n

nXnnn

ii

n

ii

n

iinn

XOPOP

xoPxxPxXOP

,1

),()(

))|(())|(()(),(

,1,1,1

1

1

111,1,1

o1on

X1 X2Xn…

o2

22

Output symbols are generated by the to-states

• State sequence: X1,n+1

• Output sequence: O1,n

1,1

),()(

))|()|(()(),(

1,1,1,1

11

111,1,1

nXnnn

iii

n

iinn

XOPOP

xoPxxPxXOP

o1on

X2 X3Xn+1…

o2

X1

23

A path in a state-emission HMM

o1on

X1 X2Xn…

o2

o1on

X2 X3Xn+1…

o2

X1

Output symbols are produced by the from-states:

Output symbols are produced by the to-states:

24

Arc-emission vs. state-emission

o1on

X2 X3Xn+1…

o2

X1

o1 onX1 X2

Xn…o2 Xn+1

25

Properties of HMM

• Markov assumption (Limited horizon):

• Stationary distribution (Time invariance): the probabilities do not change over time:

• The states are hidden because we know the structure of the machine (i.e., S and Σ), but we don’t know which state sequences generate a particular output.

)|(),...,|( 1211 tttt XXPXXXXP

)|()|( 11 mtmttt XXPXXP

26

Are the two types of HMMs equivalent?

• For each state-emission HMM1, there is an arc-emission HMM2, such that for any sequence O, P(O|HMM1)=P(O|HMM2).

• The reverse is also true.

• How to prove that?

27

Applications of HMM

• N-gram POS tagging– Bigram tagger: oi is a word, and si is a POS tag.

• Other tagging problems: – Word segmentation– Chunking– NE tagging– Punctuation predication– …

• Other applications: ASR, ….

28

Three HMM questions

29

Three fundamental questions for HMMs

• Training an HMM: given a set of observation sequences, learn its distribution, i.e. learn the transition and emission probabilities

• HMM as a parser: Finding the best state sequence for a given observation

• HMM as an LM: compute the probability of a given observation

30

Training an HMM: estimating the probabilities

• Supervised learning:– The state sequences in the training data are

known– ML estimation

• Unsupervised learning:– The state sequences in the training data are

unknown– forward-backward algorithm

31

HMM as a parser

32

HMM as a parser: Finding the best state sequence

• Given the observation O1,T=o1…oT, find the state sequence X1,T+1=X1 … XT+1 that maximizes P(X1,T+1 | O1,T).

Viterbi algorithm

X1 X2XT…

o1 o2oT

XT+1

33

“time flies like an arrow”\init BOS 1.0

\transition BOS N 0.5 BOS DT 0.4 BOS V 0.1

DT N 1.0

N N 0.2 N V 0.7 N P 0.1

V DT 0.4 V N 0.4 V P 0.1 V V 0.1

P DT 0.6 P N 0.4

\emission N time 0.1 V time 0.1 N flies 0.1 V flies 0.2 V like 0.2 P like 0.1 DT an 0.3 N arrow 0.1

34

Finding all the paths: to build the trellis


N N N N

V V V V

P P P P

DT DT DT DT

BOS

N

V

P

DT

35

Finding all the paths (cont)


N N N N

V V V V

P P P P

DT DT DT DT

BOS

N

V

P

DT

36

Viterbi algorithm

The probability of the best path that produces O1,t-1 while ending up in state sj:

),,(max)( 1,11,11,1

jXOXPt tttX

def

jt

jj )1(

tijoijii

j batt )(max)1(

Initialization:

Induction:

Modify it to allow ²-emission

37

Proof of the recursive function

tt

tt

t

ttt

tt

t

t

ijoijiiX

tttX

ijoijiX

ijoijtttXiX

ttttttttXiX

tttttX

tttX

j

bat

iXOXPba

baiXOXP

iXOXjXoPiXOXP

jXXOOXP

jXOXPt

)(max

)),,(max(max

),,(maxmax

),,|,(),,(maxmax

),,,,(max

),,(max)1(

1,11,1

1,11,1

1,11,111,11,1

11,11,1

1,1,1

1,1

1,1

1,1

,1

,1

38

Viterbi algorithm: calculating ±j(t) # N is the number of states in the HMM structure# observ is the observation O, and leng is the length of

observ.

Initialize viterbi[0..leng] [0..N-1] to 0 for each state j viterbi[0] [j] = ¼[j] back-pointer[0] [j] = -1 # dummy

for (t=0; t<leng; t++) for (j=0; j<N; j++) k=observ[t] # the symbol at time t viterbi[t+1] [j] = maxi viterbi[t] [i] aij bjk

back-pointer[t+1] [j] = arg maxi viterbi[t] [i] aij bjk

39

Viterbi algorithm: retrieving the best path

# find the best pathbest_final_state = arg maxj viterbi[leng] [j]

# start with the last state in the sequencej = best_final_state

push(arr, j);

for (t=leng; t>0; t--) i = back-pointer[t] [j] push(arr, i) j = i

return reverse(arr)

40

Hw7 and Hw8

• Hw7: write an HMM “class”:– Read HMM input file– Output HMM

• Hw8: implement the algorithms for two HMM tasks:– HMM as parser: Viterbi algorithm– HMM as LM: the prob of an observation

41

Implementation issue storing HMM

Approach #1: • ¼i: pi {state_str}• aij: a {from_state_str} {to_state_str}• bjk: b {state_str} {symbol}

Approach #2: • state2idx{state_str} = state_idx• symbol2idx{symbol_str} = symbol_idx

• ¼i: pi [state_idx] = prob• aij: a [from_state_idx] [to_state_idx] = prob• bjk: b [state_idx] [symbol_idx] = prob

• idx2state[state_idx] = state_str• Idx2symbol[symbol_idx] = symbol_str

42

Storing HMM: sparse matrix

• aij: a [i] [j] = prob• bjk: b [j] [k] = prob

• aij: a[i] = “j1 p1 j2 p2 …”• aij: a[j] = “i1 p1 i2 p2 …”

• bjk: b[j] = “k1 p1 k2 p2 ….” • bjk: b[k] = “j1 p1 j2 p2 …”

43

Other implementation issues

• Index starts from 0 in programming, but often starts from 1 in algorithms

• The sum of logprob is used in practice to replace the product of prob.

• Check constraints and print out warning if the constraints are not met.

44

HMM as LM

45

HMM as an LM: computing P(o1, …, oT)

1st try: - enumerate all possible paths - add the probabilities of all paths

46

Forward probabilities

• Forward probability: the probability of producing O1,t-1 while ending up in state si:

),()( 1,1 iXOPt tt

def

i

)1()(1

TOPN

ii

47

Calculating forward probability

tijoiji

i

ttj

bat

jXOPt

)(

),()1( 1,1

jj )1(Initialization:

Induction:

48

tijoiji

i

ttttti

tttttti

tti

t

ttj

bat

iXjXoPiXOP

iXOjXoPiXOP

jXiXOP

jXOPt

)(

)|,(*),(

),|,(*),(

),,(

),()1(

11,1

1,111,1

1,1

1,1

49

Summary

• Definition: hidden states, output symbols

• Properties: Markov assumption

• Applications: POS-tagging, etc.

• Three basic questions in HMM– Find the probability of an observation: forward probability– Find the best sequence: Viterbi algorithm– Estimate probability: MLE

• Bigram POS tagger: decoding with Viterbi algorithm

1 hmm (i) ling 570 fei xia week 7: 11/5-11/7/07. 2 hmm definition and properties of hmm –two types...

Documents

n states s

output slide

set of states s

state arcemission

emission probabilities

definition of state

finite set of n states

arc emission