1 hidden markov model instructor : saeed shiry chapter 13 ethem alpaydin © the mit press, 2004
Post on 21-Dec-2015
243 views
TRANSCRIPT
1
Hidden Markov Model
Instructor : Saeed Shiry
CHAPTER 13ETHEM ALPAYDIN© The MIT Press, 2004
2
مقدمه In a word successive letters are dependent; in English ‘h' is very likely
to follow ‘t' but not 'x'. Such processes where there is a sequence of observations-for example, letters in a word, base pairs in a DNA sequence-cannot be modeled as simple probability distributions.
A similar example is speech recognition where speech utterances are composed of speech primitives called phonemes; only certain sequences of phonemes are allowed, which are the words of the language.
At a higher level, words can be written or spoken in certain sequences to form a sentence as defined by the syntactic and semantic rules of the language.
A sequence can be characterized as being generated by a parametric random process. In this chapter, we discuss how this modeling is done and also how the parameters of such a model can be learned from a training sample of example sequences.
3
اهداف
Modeling dependencies in input Sequences:
Temporal: In speech; phonemes in a word (dictionary), words in a sentence (syntax, semantics of the language).
In handwriting, pen movements Spatial: In a DNA sequence; base pairs
4
History Markov chain theory developed around 1900. Hidden Markov Models developed in late 1960’s. Used extensively in speech recognition in 1960-70. Introduced to computer science in 1989.
Bioinformatics. Signal Processing Data analysis and Pattern recognition
Applications
5
HMMs and their Usage
HMMs are very common in Computational Linguistics: Speech recognition (observed: acoustic signal, hidden:
words) Handwriting recognition (observed: image, hidden: words) Part-of-speech tagging (observed: words, hidden: part-of-
speech tags) Machine translation (observed: foreign words, hidden:
words in target language)
6
Discrete Markov Processes
Consider a system that at any time is in one of a set of N distinct states:
The state at time t is denoted as
so for example qt = Si means that at time t. the system is in state Si.
S = { s1, s2..., sn }
qt , t = 1,2, ... ,
7
Discrete Markov Processes At regularly spaced discrete times the system
moves to a state with a given probability, depending on the values of the previous states:
P(qt+1=Sj | qt=Si, qt-1=Sk ,...) = P(qt+1=Sj | qt=Si) First-order Markov
P(qt+1=Sj | qt=Si, qt-1=Sk ,...)
For the special case of a first-order Markov model, the state at time t + 1 depends only on state at time t, regardless of the states in the previous times:
Today is the first day of the rest of your life.
8
Discrete Markov Processes
Transition probabilities
aij ≡ P(qt+1=Sj | qt=Si) aij ≥ 0 and Σj=1N aij=1
Initial probabilities
πi ≡ P(q1=Si) Σj=1N πi=1
Going from Si to Sj has the same probability no matter when it happens, or where it happens in the observation sequence. A = [aij] is a N x N matrix whose rows sum to 1.
9
Stochastic Automaton
This can be seen as a stochastic automaton
10
Observable Markov model
In an observable Markov model. the states are observable. At any time t, we know qt, and as the system moves from one state to another, we get an observation sequence that is a sequence of states.
The output of the process is the set of states at each instant of time where each state corresponds to a physical observable event.
We have an observation sequence 0 that is the state sequence O = Q = {q1 q2 .. qt}, whose probability is given as:
TT qqqqq
T
ttt aaqqPqP,QOP
12112
11 ||
A
11
Example: Balls and Urns
Three urns each full of balls of one color
S1: red, S2: blue, S3: green
048080304050
||||
801010
206020
303040
302050
3313111
3313111
3311
.....
aaa
SSPSSPSSPSP,OP
S,S,S,SO
...
...
...
.,.,. T
A
A
12
Balls and Urns: Learning
Given K example sequences of length T
k
T-
t ikt
k
T-
t jkti
kt
i
jiij
k ik
ii
Sq
SqSq
S#
SS#a
K
Sq
#S#
ˆ
1
1
1
1 1
1
1
and 1
from stransition
to from stransition
1
sequences withstarting sequences
13
Hidden Markov Models
States are not observable but when we visit a state, an observation is recorded that
is a probabilistic function of the state. We assume a discrete observation in each state from the set {v1,v2,...,vM}
Emission probabilitiesbj(m) ≡ P(Ot=vm | qt=Sj)
bj(m) is the observation, or emission probability that we observe Vrn ,m = 1, ... ,M in state Sj.
The state sequence Q is not observed, that is what makes the model "hidden," but it should be inferred from the observation sequence O.
14
For each observation sequence, there are multiple state sequences
In this case of a hidden Markov model, there are two sources of randomness: Additional to randomly moving from one state to another, the observation in a state is also random.
15
Example: Balls and Urns The hidden case: each urn contains balls of different colors. Let bj (m) the probability of drawing a ball of color m from urn j. We observe a sequence of ball colors but without knowing the
sequence of urns from which the balls were drawn. The number of ball colors may be different from the number of
urns. For example, let us say we have three urns and the observation sequence is
O = {red, red, green, blue, yellow} in the case of a hidden model, a ball could have been picked
from any urn. In this case, for the same observation sequence O, there may be many possible state sequences Q that could have generated.
16
HMM Unfolded in Time
17
Elements of an HMM
N: Number of states M: Number of observation symbols A = [aij]: N by N state transition probability
matrix B = bj(m): N by M observation probability
matrix Π = [πi]: N by 1 initial state probability vector
λ = (A, B, Π), parameter set of HMM
18
Three Basic Problems of HMMs
(Rabiner, 1989)
Given a number of sequences of observations, we are interested in three problems:
Evaluation: Given a model λ, evaluate the probability of any given observation sequence, O = {O1O2 .. OT}, namely, P (O | λ)
State sequence: Given λ, and O, find out the state sequence Q = {qlq2 ... qT}, which has the highest probability of generating O, or find Q* such that maximizes P (Q* | O, λ )
Learning: Given a training set of observation sequences, X={Ok}k, find λ* such that
P ( X | λ* )=maxλ P ( X | λ )
19
Evaluation Given an observation sequence 0 = {0102 ... OT} and a
state sequence Q = {ql q2 ... qT}, the robability of observing O given the state sequence Q is simply
The probability of the state sequence Q is
20
Forward variable: We define the forward variable at (i) as the probability
of observing the partial sequence {01 ... Ot} until time t and being in Si at time t, given the model , λ:
|,1 ittt SqOOPi
Initialization:
Recursion 11
1
tj
N
iijtt Obaij
21
Forward variable:
When we calculate the forward variables, it is easy to calculate the probability of the observation sequence:
T (i) is the probability of generating the full observation sequence and ending up in state Si. We need to sum up over all such possible final states.
22
Backward variable:
N
jttjijt
T
itTtt
jObai
i
SqOOPi
111
1
:Recursion
1
:tionInitializa
,|
23
caution
24
Finding the State Sequence
Let us define t (i) as the probability of being in state Si at time t, given O and λ, which can be computed as
N
j tt
tt
itt
jj
ii
,OSqPi
1
No!
Choose the state that has the highest probability, for each time step:
qt*= arg maxi γt(i)
25
Viterbi’s Algorithm
δt(i) ≡ maxq1q2∙∙∙ qt-1 p(q1q2∙∙∙qt-1,qt =Si,O1∙∙∙Ot | λ)
Initialization: δ1(i) = πibi(O1), ψ1(i) = 0
Recursion: δt(j) = maxi δt-1(i)aijbj(Ot), ψt(j) = argmaxi δt-1(i)aij
Termination:p* = maxi δT(i), qT
*= argmaxi δT (i) Path backtracking:
qt* = ψt+1(qt+1
* ), t=T-1, T-2, ..., 1
26
Learning
otherwise0
and if 1
otherwise0
if 1
(EM) algorithm Welch-Baum
|
1
11
11
1
jtittij
itti
k l ttlklt
ttjijtt
jtitt
SqSqz
Sqz
:
lObak
jObaij,i
,OSq,SqPj,i
We define (i, j) as the probability of being in Sj at time t and in Sj at time t + I, given the whole
observation O and λ :
27
Baum-Welch (EM)
1
:stepM
:stepE
1
1
1
1
1
1
1
1
1
1
1
111
K
k
T
t
kt
K
k
T
t mkt
kt
j
K
k
T
t
kt
K
k
T
t
kt
ij
K
k
k
i
ttijt
ti
k
k
k
k
i
vOjmb
i
j,ia
K
iˆ
j,izEizE
28
References
Concept: Rabiner, L. R. (1989). A Tutorial on Hidden Markov
Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(2), 257-285.
Application: Geng, J. and Yang, J. (2004). Automatic Extraction
of Bibliographic Information on the Web. Proceedings of the 8th International Database Engineering and Applications Symposium (IDEAS’04), 193-204.