hmm viterbi

Markov Models &Hidden Markov Models

Time-based Models

• Simple parametric distributions are typically based on what is called the “independence assumption”- each data point is independent of the others, and there is no time-sequencing or ordering.

• What if the data has correlations based on its order, like a time-series?

States• An atomic event is an assignment to every

random variable in the domain.• States are atomic events that can transfer

from one to another• Suppose a model has n states • A state-transition diagram describes how the

model behaves

State-transition

Following assumptions- Transition probabilities are stationary- The event space does not change over time- Probability distribution over next states depends only on the current state

State-transition

Following assumptions- Transition probabilities are stationary- The event space does not change over time- Probability distribution over next states depends

only on the current state

Markov Assumption

Markov random processes

• A random sequence has the Markov property if its distribution is determined solely by its current state.

• Any random process having this property is called a Markov random process.

• A system with states that obey the Markov assumption is called a Markov Model.

• A sequence of states resulting from such a model is called a Markov Chain.

Chain Rule & Markov Property

),...(),...|(),...,( 111111 qqPqqqPqqqP ttttt

),...(),...|(),...|( 1212111 qqPqqqPqqqP ttttt

t

iii qqqPqP

2111 ),...|()(

1)|(),...|( 111 iforqqPqqqP iiii

)|()...|()()|()(),...,( 11212

1111

tt

t

iiitt qqPqqPqPqqPqPqqqP

Bayes rule

Markov property

Markov Assumption

• The Markov assumption states that probability of the occurrence of word wi at time t depends only on occurrence of word wi-

1 at time t-1– Chain rule:

– Markov assumption:

P(w1,...,wn ) P(wi |wi 1)i2

n

P(w1,...,wn ) P(wi |w1,...,wi 1)i2

n

Andrei Andreyevich Markov

Born: 14 June 1856 in Ryazan, RussiaDied: 20 July 1922 in Petrograd (now St Petersburg), RussiaMarkov is particularly remembered for his study of Markov chains, sequences of random variables in which the future variable is determined by the present variable but is independent of the way in which the present state arose from its predecessors. This work launched the theory of stochastic processes.

s1 s3

s2

Has N states, called s1, s2 .. sN

There are discrete timesteps, t=0, t=1, …

N = 3

t=0

A Markov System

s1 s3

s2



On the t’th timestep the system is in exactly one of the available states. Call it qt

Note: qt {s1, s2 .. sN }

N = 3

t=0

qt=q0=s3

Current State

A Markov System

A Markov System

s1 s3

s2





Between each timestep, the next state is chosen randomly.

N = 3

t=1

qt=q1=s2

Current State

)( 1 ii sqP

s1 s3

s2qt+1 is conditionally independent of { qt-1, qt-2, … q1, q0 } given qt.

In other words:

P(qt+1 = sj |qt = si ) =

P(qt+1 = sj |qt = si ,any earlier history)

Notation:N = 3

t=1

qt=q1=s2

P(qt+1=s1|qt=s3) = 1/3

P(qt+1=s2|qt=s3) = 2/3

P(qt+1=s3|qt=s3) = 0

P(qt+1=s1|qt=s1) = 0

P(qt+1=s2|qt=s1) = 0

P(qt+1=s3|qt=s1) = 1

P(qt+1=s1|qt=s2) = 1/2

P(qt+1=s2|qt=s2) = 1/2

P(qt+1=s3|qt=s2) = 0

1/2

1/2

1/3

2/3

1

)|( 1 jitij sqsqPa

Markov Property

)( 1 ii sqP

s1 s3

s2qt+1 is conditionally independent of { qt-1, qt-2, … q1, q0 } given qt.

In other words:

P(qt+1 = sj |qt = si ) =

P(qt+1 = sj |qt = si ,any earlier history)

Notation:N = 3

t=1

qt=q1=s2

P(qt+1=s1|qt=s3) = 1/3

P(qt+1=s2|qt=s3) = 2/3

P(qt+1=s3|qt=s3) = 0

P(qt+1=s1|qt=s1) = 0

P(qt+1=s2|qt=s1) = 0

P(qt+1=s3|qt=s1) = 1

P(qt+1=s1|qt=s2) = 1/2

P(qt+1=s2|qt=s2) = 1/2

P(qt+1=s3|qt=s2) = 0

1/2

1/2

1/3

2/3

1

)|( 1 jitij sqsqPa

Markov Property

Transition probability

Initial probability

Example: A Simple Markov Model For Weather Prediction

• Any given day, the weather can be described as being in one of three states:– State 1: precipitation (rain, snow, hail, etc.)– State 2: cloudy– State 3: sunnyTransitions between states are described by the transition matrix

This model can then be described by the following directed graph

Basic Calculations• Example: What is the probability that the

weather for eight consecutive days is “sun-sun-sun-rain-rain-sun-cloudy-sun”?

• Solution:• O = sun sun sun rain rain sun cloudy sun 3 3 3 1 1 3 2 3

From Markov To Hidden Markov• The previous model assumes that each state can be uniquely

associated with an observable event– Once an observation is made, the state of the system is then trivially

retrieved– This model, however, is too restrictive to be of practical use for most

realistic problems• To make the model more flexible, we will assume that the

outcomes or observations of the model are a probabilistic function of each state– Each state can produce a number of outputs according to a unique

probability distribution, and each distinct output can potentially be generated at any state

– These are known a Hidden Markov Models (HMM), because the state sequence is not directly observable, it can only be approximated from the sequence of observations produced by the system

The coin-toss problem • To illustrate the concept of an HMM consider the following

scenario– Assume that you are placed in a room with a curtain– Behind the curtain there is a person performing a coin-toss

experiment– This person selects one of several coins, and tosses it: heads (H) or

tails (T)– The person tells you the outcome (H,T), but not which coin was used

each time• Your goal is to build a probabilistic model that best explains

a sequence of observations O={o1,o2,o3,o4,…}={H,T,T,H,,…}– The coins represent the states; these are hidden because you do not

know which coin was tossed each time– The outcome of each toss represents an observation– A “likely” sequence of coins may be inferred from the observations,

but this state sequence will not be unique•

The Coin Toss Example – 2 coins

From Markov to Hidden Markov Model: The Coin Toss Example – 3 coins

Hidden model

• As spectators, we can not tell which coin is being used, all we can observe is the output (head/tail)

• We assume the outputs are based on coin tendencies (output) probabilities

Coin Toss Example

C1 C2 CL-1 CL

P1 P2 PL-1 PL

Ci

Pi

hidden state variables= coins

observed data(“output”) = heads/tails

L

Hidden Markov Models

• Used when states can not directly be observed, good for noisy data

• Requirements:– A finite number of states, each with an output probability

distribution– State transition probabilities– Observed phenomenon, which can be randomly generated

given state-associated probabilities.

HMM Notation(from Rabiner’s Survey)

The states are labeled S1 S2 .. SN

For a particular trial….Let T be the number of observations

T is also the number of states passed throughO = O1 O2 .. OT is the sequence of observations

Q = q1 q2 .. qT is the notation for a path of states

λ = N,M,i,,aij,bi(j) is the specification of an HMM

*L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989.

HMM Formal DefinitionAn HMM, λ, is a 5-tuple consisting of• N the number of states• M the number of possible observations• {1, 2, .. N} The starting state probabilities

P(q0 = Si) = i

• a11 a22 … a1N a21 a22 … a2N

: : : aN1 aN2 … aNN

• b1(1) b1(2) … b1(M) b2(1) b2(2) … b2(M) : : : bN(1) bN(2) … bN(M)

The state transition probabilities

P(qt+1=Sj | qt=Si)=aij

The observation probabilities

P(Ot=k | qt=Si)=bi(k)

Assumptions

• Markov assumption– States depend on previous states

• Stationary assumption– Transition probabilities are independent of time

(“memoryless”)• Output independence

– Observations are independent of previous observations

The three main questions on HMMs

• Evaluation– What is the probability that the observations were

generated by a given model?• Decoding

– Given a model and a sequence of observations, what is the most likely state observations?

• Learning:– Given a model and a sequence of observations, how

should we modify the model parameters to maximize p{observe|model)


1. Evaluation

GIVEN a HMM M, and a sequence x,FIND Prob[ x | M ]

2. Decoding

GIVEN a HMM M, and a sequence x,FIND the sequence of states that maximizes P[ x, | M ]

3. Learning

GIVEN a HMM M, with unspecified transition/emission probs.,and a sequence x,

FIND parameters = (bi(.), aij) that maximize P[ x | ]

Let’s not be confused by notation

P[ x | M ]: The probability that sequence x was generated by the model

The model is: architecture (#states, etc) + parameters = aij, ei(.)

So, P[ x | ], and P[ x ] are the same, when the architecture, and the entire model, respectively, are implied

Similarly, P[ x, | M ] and P[ x, ] are the same

In the LEARNING problem we always write P[ x | ] to emphasize that we are seeking the that maximizes P[ x | ]


• Used when states can not directly be observed, good for noisy data

• Requirements:– A finite number of states, each with an output probability

distribution– State transition probabilities– Observed phenomenon, which can be randomly generated

given state-associated probabilities.

Specification of an HMM

• N - number of states– Q = {q1; q2; : : : ;qT} - set of states

• M - the number of symbols (observables)– O = {o1; o2; : : : ;oT} - set of symbols

Description

Specification of an HMM

• A - the state transition probability matrix– aij = P(qt+1 = j|qt = i)

• B- observation probability distribution– bj(k) = P(ot = k|qt = j) i ≤ k ≤ M

• π - the initial state distribution

Description

HMM Formal DefinitionAn HMM, λ, is a 5-tuple consisting of• N the number of states• M the number of possible observations• {1, 2, .. N} The starting state probabilities

P(q0 = Si) = i

• a11 a22 … a1N a21 a22 … a2N

: : : aN1 aN2 … aNN

• b1(1) b1(2) … b1(M) b2(1) b2(2) … b2(M) : : : bN(1) bN(2) … bN(M)

The state transition probabilities

P(qt+1=Sj | qt=Si)=aij

The observation probabilities

P(Ot=k | qt=Si)=bi(k)

Assumptions

• Markov assumption– States depend on previous states

• Stationary assumption– Transition probabilities are independent of time

(“memoryless”)• Output independence

– Observations are independent of previous observations


• Evaluation– What is the probability that the observations were

generated by a given model?• Decoding

– Given a model and a sequence of observations, what is the most likely state observations?

• Learning:– Given a model and a sequence of observations, how

should we modify the model parameters to maximize p{observe|model)

Central problems in HMM modelling

• Problem 1Evaluation:– Probability of occurrence of a particular

observation sequence, O = {o1,…,ok}, given the model

– P(O|λ)– Complicated – hidden states– Useful in sequence classification

Centralproblems


• Problem 2Decoding:– Optimal state sequence to produce given

observations, O = {o1,…,ok}, given model

– Optimality criterion– Useful in recognition problems

Centralproblems


• Problem 3Learning:– Determine optimum model, given a training set of

observations– Find λ, such that P(O|λ) is maximal

Centralproblems

Task: Part-Of-Speech Tagging

• Goal: Assign the correct part-of-speech to each word (and punctuation) in a text.

• Example:

• Learn a local model of POS dependencies, usually from pretagged data

Two old men bet on the game .CRD ADJ NN VBD Prep Det NN SYM


• Assume: POS generated as random process, and each POS randomly generates a word

Det

NN

NNS

ADJ

0.20.3

0.5

0.3

0.5

0.9

0.2

0.1“the”

“a” 0.6

0.4

“cat”“bet”

“cats”

“men”

• First-order (bigram) Markov assumptions:– Limited Horizon: Tag depends only on previous tag

P(ti+1 = tk | t1=tj1,…,ti=tji) = P(ti+1 = tk | ti = tj)

– Time invariance: No change over time

P(ti+1 = tk | ti = tj) = P(t2 = tk | t1 = tj) = P(tj tk)

• Output probabilities:– Probability of getting word wk for tag tj: P(wk | tj)– Assumption:

Not dependent on other tags or words!

HMMs For Tagging

Combining Probabilities• Probability of a tag sequence:P(t1t2…tn) = P(t1)P(t1t2)P(t2t3)…P(tn-1tn)Assume t0 – starting tag:

= P(t0t1)P(t1t2)P(t2t3)…P(tn-1tn)

• Prob. of word sequence and tag sequence:

P(W,T) = i P(ti-1ti) P(wi | ti)

Training from labeled training

• Labeled training = each word has a POS tag• Thus:

π(tj) = PMLE(tj) = C(tj) / N

a(tjtk) = PMLE(tjtk) = C(tj, tk) / C(tj)

b(tj | wk) = PMLE(tj | wk) = C(tj:wk) / C(wk)

• Smoothing applies as usual

Three Basic Problems

• Compute the probability of a text: Pλ(W1,N)

• Compute maximum probability tag sequence:

arg maxT1,N Pλ(T1,N | W1,N)

• Compute maximum likelihood modelarg max λ Pλ(W1,N)

Problem 1: Naïve solution

• State sequence Q = (q1,…qT)

• Assume independent observations:

)()...()(),|(,|( 22111

TqTqqt

T

it obobobqoPqOP

Centralproblems

NB Observations are mutually independent, given the hidden states. (Joint distribution of independent variables factorises into marginal distributions of the independent variables.)


• Observe that :

• And that:

qTqTqqqqq aaaqP 132211 ...)|(

Centralproblems

q

qPqOPOP )|(),|()|(


• Finally get:

Centralproblems

q

qPqOPOP )|(),|()|(

NB:-The above sum is over all state paths-There are NT states paths, each ‘costing’ O(T) calculations, leading to O(TNT) time complexity.

Problem 1: Efficient solution

• Define auxiliary forward variable α:

Centralproblems

),|,...,()( 1 iqooPi ttt

αt(i) is the probability of observing a partial sequence ofobservables o1,…ot such that at time t, state qt=i

Forward algorithm:

Problem 1: Efficient solution• Recursive algorithm:

– Initialise:

– Calculate:

– Obtain:

)()( 11 obi ii

Centralproblems

)(])([)( 11

1

tj

N

iijtt obaij

N

iT iOP

1

)()|(

Complexity is O(N2T)

(Partial obs seq to t AND state i at t) x (transition to j at t+1) x (sensor)

Sum of different ways of getting obs seq

Sum, as can reach j from any preceding state

incorporates partial obs seq to t

Forward AlgorithmDefine αk(i) = P(w1,k, tk=ti)

1. For i = 1 To N: α1(i) = a(t0ti)b(w1 | ti)

• For k = 2 To T; For j = 1 To N:

αk(j) = [i αk-1(i)a(titj)]b(wk | tj)

• Then: Pλ(W1,T) = i αT(i)

Complexity = O(N2 T)

Forward Algorithm

t1

t2

t5

t4

t3

w1

t1

t2

t5

t4

t3

w2

t1

t2

t5

t4

t3

w3

α1(1)

α1(2)

α1(5)

α1(4)

α1(3)

a(t0ti)

α2(1)

α2(2)

α2(5)

α2(4)

α2(3)

α3(1)

α3(2)

α3(5)

α3(4)

α3(3)

a(t1t1)

a(t2t1)

a(t3t1)

a(t4t1)

a(t5t1)

a(t1t1)

a(t2t1)

a(t3t1)

a(t4t1)

a(t5t1)

Pλ(W1,3)

Problem 1: Alternative solution

• Define auxiliary forward variable β:

Centralproblems

Backward algorithm:

),|,...,,()( 21 iqoooPi tTttt

t(i) – the probability of observing a sequence ofobservables ot+1,…,oT given state qt =i at time t, and

Problem 1: Alternative solution• Recursive algorithm:

– Initialise:

– Calculate:

– Terminate:

1)( jT

Centralproblems

Complexity is O(N2T)

1 11

( ) ( ) ( )N

t t ij j tj

i j a b o

N

i

iOp1

1 )()|( 1,...,1Tt

Backward AlgorithmDefine βk(i) = P(wk+1,N | tk=ti) --note the difference!

1. For i = 1 To N: βT(i) = 1

1. For k = T-1 To 1; For j = 1 To N:

βk(j) = [i a(tjti)b(wk+1 | ti) βk+1(i)]2. Then:

Pλ(W1,T) = i a(t0ti)b(w1 | ti) β1(i)

Complexity = O(Nt2 N)

Backward Algorithm

t1

t2

t5

t4

t3

w1

t1

t2

t5

t4

t3

w2

t1

t2

t5

t4

t3

w3

β1(1)

β1(2)

β1(5)

β1(4)

β1(3)

a(t0ti)

β2(1)

β2(2)

β2(5)

β2(4)

β2(3)

β3(1)

β3(2)

β3(5)

β3(4)

β3(3)

a(t1t1)

a(t2t1)

a(t3t1)

a(t4t1)

a(t5t1)

a(t1t1)

a(t2t1)

a(t3t1)

a(t4t1)

a(t5t1)

Pλ(W1,3)

Viterbi Algorithm (Decoding)• Most probable tag sequence given text:

T* = arg maxT Pλ(T | W)

= arg maxT Pλ(W | T) Pλ(T) / Pλ(W)(Bayes’ Theorem)

= arg maxT Pλ(W | T) Pλ(T)(W is constant for all T)

= arg maxT i[a(ti-1ti) b(wi | ti) ]= arg maxT i log[a(ti-1ti) b(wi | ti) ]

t1

t2

t3

w1

t1

t2

t3

w2

t1

t2

t3

w3

t0

A(,) t1 t2 t3

t0 0.005 0.02 0.1

t1 0.02 0.1 0.005

t2 0.5 0.0005 0.0005

t3 0.05 0.05 0.005

B(,) w1 w2 w3

t1 0.2 0.005 0.005

t2 0.02 0.2 0.0005

t3 0.02 0.02 0.05

-log A t1 t2 t3

t0 2.3 1.7 1

t1 1.7 1 2.3

t2 0.3 3.3 3.3

t3 1.3 1.3 2.3

-log B w1 w2 w3

t1 0.7 2.3 2.3

t2 1.7 0.7 3.3

t3 1.7 1.7 1.3

t1

t2

t3

w1

t1

t2

t3

w2

t1

t2

t3

w3

t0

-1.7

-0.3

-1.3

-3

-3.4

-2.7

-2.3

-1.7

-1

-6

-4.7

-6.7

-1.7

-0.3

-1.3

-7.3

-9.3

-10.3

Viterbi Algorithm1. D(0, START) = 02. for each tag t != START do: D(1, t) = -3. for i 1 to N do:

a. for each tag tj do:D(i, tj) maxk D(i-1,tk)b(wi|tj)a(tktj)

D(i, tj) maxk D(i-1,tk) + logb(wi|tj) + loga(tktj)

4. log P(W,T) = maxj D(N, tj)

where logb(wi|tj) = log b(wi|tj) and so forth

start

fair loaded

Heads

Tails

Heads

Tails

0.5 0.5

0.1

0.1

0.90.9

Question: Suppose the sequence of our game is: HHHTHHHTTHHTH?

What is the probability of the sequence given the model?

Decoding

• Suppose we have a text written by Shakespeare and a monkey. Can we tell who wrote what?

• Text: Shakespeare or Monkey?

• Case 1: – Fehwufhweuromeojulietpoisonjigjreijge

• Case 2:– mmmmbananammmmmmmbananammm

hmm viterbi

Education