1 hidden markov model instructor : saeed shiry chapter 13 ethem alpaydin © the mit press, 2004

1

Hidden Markov Model

Instructor : Saeed Shiry

CHAPTER 13ETHEM ALPAYDIN© The MIT Press, 2004

2

مقدمه In a word successive letters are dependent; in English ‘h' is very likely

to follow ‘t' but not 'x'. Such processes where there is a sequence of observations-for example, letters in a word, base pairs in a DNA sequence-cannot be modeled as simple probability distributions.

A similar example is speech recognition where speech utterances are composed of speech primitives called phonemes; only certain sequences of phonemes are allowed, which are the words of the language.

At a higher level, words can be written or spoken in certain sequences to form a sentence as defined by the syntactic and semantic rules of the language.

A sequence can be characterized as being generated by a parametric random process. In this chapter, we discuss how this modeling is done and also how the parameters of such a model can be learned from a training sample of example sequences.

3

اهداف

Modeling dependencies in input Sequences:

Temporal: In speech; phonemes in a word (dictionary), words in a sentence (syntax, semantics of the language).

In handwriting, pen movements Spatial: In a DNA sequence; base pairs

4

History Markov chain theory developed around 1900. Hidden Markov Models developed in late 1960’s. Used extensively in speech recognition in 1960-70. Introduced to computer science in 1989.

Bioinformatics. Signal Processing Data analysis and Pattern recognition

Applications

5

HMMs and their Usage

HMMs are very common in Computational Linguistics: Speech recognition (observed: acoustic signal, hidden:

words) Handwriting recognition (observed: image, hidden: words) Part-of-speech tagging (observed: words, hidden: part-of-

speech tags) Machine translation (observed: foreign words, hidden:

words in target language)

6

Discrete Markov Processes

Consider a system that at any time is in one of a set of N distinct states:

The state at time t is denoted as

so for example qt = Si means that at time t. the system is in state Si.

S = { s1, s2..., sn }

qt , t = 1,2, ... ,

7

Discrete Markov Processes At regularly spaced discrete times the system

moves to a state with a given probability, depending on the values of the previous states:

P(qt+1=Sj | qt=Si, qt-1=Sk ,...) = P(qt+1=Sj | qt=Si) First-order Markov

P(qt+1=Sj | qt=Si, qt-1=Sk ,...)

For the special case of a first-order Markov model, the state at time t + 1 depends only on state at time t, regardless of the states in the previous times:

Today is the first day of the rest of your life.

8

Discrete Markov Processes

Transition probabilities

aij ≡ P(qt+1=Sj | qt=Si) aij ≥ 0 and Σj=1N aij=1

Initial probabilities

πi ≡ P(q1=Si) Σj=1N πi=1

Going from Si to Sj has the same probability no matter when it happens, or where it happens in the observation sequence. A = [aij] is a N x N matrix whose rows sum to 1.

9

Stochastic Automaton

This can be seen as a stochastic automaton

10

Observable Markov model

In an observable Markov model. the states are observable. At any time t, we know qt, and as the system moves from one state to another, we get an observation sequence that is a sequence of states.

The output of the process is the set of states at each instant of time where each state corresponds to a physical observable event.

We have an observation sequence 0 that is the state sequence O = Q = {q1 q2 .. qt}, whose probability is given as:

TT qqqqq

T

ttt aaqqPqP,QOP

12112

11 ||

A

11

Example: Balls and Urns

Three urns each full of balls of one color

S1: red, S2: blue, S3: green

048080304050

||||

801010

206020

303040

302050

3313111

3313111

3311

.....

aaa

SSPSSPSSPSP,OP

S,S,S,SO

...

...

...

.,.,. T

A

A

12

Balls and Urns: Learning

Given K example sequences of length T

k

T-

t ikt

k

T-

t jkti

kt

i

jiij

k ik

ii

Sq

SqSq

S#

SS#a

K

Sq

#S#

ˆ

1

1

1

1 1

1

1

and 1

from stransition

to from stransition

1

sequences withstarting sequences

13

Hidden Markov Models

States are not observable but when we visit a state, an observation is recorded that

is a probabilistic function of the state. We assume a discrete observation in each state from the set {v1,v2,...,vM}

Emission probabilitiesbj(m) ≡ P(Ot=vm | qt=Sj)

bj(m) is the observation, or emission probability that we observe Vrn ,m = 1, ... ,M in state Sj.

The state sequence Q is not observed, that is what makes the model "hidden," but it should be inferred from the observation sequence O.

14

For each observation sequence, there are multiple state sequences

In this case of a hidden Markov model, there are two sources of randomness: Additional to randomly moving from one state to another, the observation in a state is also random.

15

Example: Balls and Urns The hidden case: each urn contains balls of different colors. Let bj (m) the probability of drawing a ball of color m from urn j. We observe a sequence of ball colors but without knowing the

sequence of urns from which the balls were drawn. The number of ball colors may be different from the number of

urns. For example, let us say we have three urns and the observation sequence is

O = {red, red, green, blue, yellow} in the case of a hidden model, a ball could have been picked

from any urn. In this case, for the same observation sequence O, there may be many possible state sequences Q that could have generated.

16

HMM Unfolded in Time

17

Elements of an HMM

N: Number of states M: Number of observation symbols A = [aij]: N by N state transition probability

matrix B = bj(m): N by M observation probability

matrix Π = [πi]: N by 1 initial state probability vector

λ = (A, B, Π), parameter set of HMM

18

Three Basic Problems of HMMs

(Rabiner, 1989)

Given a number of sequences of observations, we are interested in three problems:

Evaluation: Given a model λ, evaluate the probability of any given observation sequence, O = {O1O2 .. OT}, namely, P (O | λ)

State sequence: Given λ, and O, find out the state sequence Q = {qlq2 ... qT}, which has the highest probability of generating O, or find Q* such that maximizes P (Q* | O, λ )

Learning: Given a training set of observation sequences, X={Ok}k, find λ* such that

P ( X | λ* )=maxλ P ( X | λ )

19

Evaluation Given an observation sequence 0 = {0102 ... OT} and a

state sequence Q = {ql q2 ... qT}, the robability of observing O given the state sequence Q is simply

The probability of the state sequence Q is

20

Forward variable: We define the forward variable at (i) as the probability

of observing the partial sequence {01 ... Ot} until time t and being in Si at time t, given the model , λ:

|,1 ittt SqOOPi

Initialization:

Recursion 11

1

tj

N

iijtt Obaij

21

Forward variable:

When we calculate the forward variables, it is easy to calculate the probability of the observation sequence:

T (i) is the probability of generating the full observation sequence and ending up in state Si. We need to sum up over all such possible final states.

22

Backward variable:

N

jttjijt

T

itTtt

jObai

i

SqOOPi

111

1

:Recursion

1

:tionInitializa

,|

23

caution

24

Finding the State Sequence

Let us define t (i) as the probability of being in state Si at time t, given O and λ, which can be computed as

N

j tt

tt

itt

jj

ii

,OSqPi

1

No!

Choose the state that has the highest probability, for each time step:

qt*= arg maxi γt(i)

25

Viterbi’s Algorithm

δt(i) ≡ maxq1q2∙∙∙ qt-1 p(q1q2∙∙∙qt-1,qt =Si,O1∙∙∙Ot | λ)

Initialization: δ1(i) = πibi(O1), ψ1(i) = 0

Recursion: δt(j) = maxi δt-1(i)aijbj(Ot), ψt(j) = argmaxi δt-1(i)aij

Termination:p* = maxi δT(i), qT

*= argmaxi δT (i) Path backtracking:

qt* = ψt+1(qt+1

* ), t=T-1, T-2, ..., 1

26

Learning

otherwise0

and if 1

otherwise0

if 1

(EM) algorithm Welch-Baum

|

1

11

11

1

jtittij

itti

k l ttlklt

ttjijtt

jtitt

SqSqz

Sqz

:

lObak

jObaij,i

,OSq,SqPj,i

We define (i, j) as the probability of being in Sj at time t and in Sj at time t + I, given the whole

observation O and λ :

27

Baum-Welch (EM)

1

:stepM

:stepE

1

1

1

1

1

1

1

1

1

1

1

111

K

k

T

t

kt

K

k

T

t mkt

kt

j

K

k

T

t

kt

K

k

T

t

kt

ij

K

k

k

i

ttijt

ti

k

k

k

k

i

vOjmb

i

j,ia

K

iˆ

j,izEizE

28

References

Concept: Rabiner, L. R. (1989). A Tutorial on Hidden Markov

Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 257-285.

Application: Geng, J. and Yang, J. (2004). Automatic Extraction

of Bibliographic Information on the Web. Proceedings of the 8th International Database Engineering and Applications Symposium (IDEAS’04), 193-204.

1 hidden markov model instructor : saeed shiry chapter 13 ethem alpaydin © the mit press, 2004

Documents