hmm viterbi

65
Markov Models & Hidden Markov Models

Upload: digvijay-singh

Post on 05-Dec-2014

1.624 views

Category:

Education


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Hmm viterbi

Markov Models &Hidden Markov Models

Page 2: Hmm viterbi

Time-based Models

• Simple parametric distributions are typically based on what is called the “independence assumption”- each data point is independent of the others, and there is no time-sequencing or ordering.

• What if the data has correlations based on its order, like a time-series?

Page 3: Hmm viterbi

States• An atomic event is an assignment to every

random variable in the domain.• States are atomic events that can transfer

from one to another• Suppose a model has n states • A state-transition diagram describes how the

model behaves

Page 4: Hmm viterbi

State-transition

Following assumptions- Transition probabilities are stationary- The event space does not change over time- Probability distribution over next states depends only on the current state

Page 5: Hmm viterbi

State-transition

Following assumptions- Transition probabilities are stationary- The event space does not change over time- Probability distribution over next states depends

only on the current state

Markov Assumption

Page 6: Hmm viterbi

Markov random processes

• A random sequence has the Markov property if its distribution is determined solely by its current state.

• Any random process having this property is called a Markov random process.

• A system with states that obey the Markov assumption is called a Markov Model.

• A sequence of states resulting from such a model is called a Markov Chain.

Page 7: Hmm viterbi

Chain Rule & Markov Property

),...(),...|(),...,( 111111 qqPqqqPqqqP ttttt

),...(),...|(),...|( 1212111 qqPqqqPqqqP ttttt

t

iii qqqPqP

2111 ),...|()(

1)|(),...|( 111 iforqqPqqqP iiii

)|()...|()()|()(),...,( 11212

1111

tt

t

iiitt qqPqqPqPqqPqPqqqP

Bayes rule

Markov property

Page 8: Hmm viterbi

Markov Assumption

• The Markov assumption states that probability of the occurrence of word wi at time t depends only on occurrence of word wi-

1 at time t-1– Chain rule:

– Markov assumption:

P(w1,...,wn ) P(wi |wi 1)i2

n

P(w1,...,wn ) P(wi |w1,...,wi 1)i2

n

Page 9: Hmm viterbi

Andrei Andreyevich Markov

Born: 14 June 1856 in Ryazan, RussiaDied: 20 July 1922 in Petrograd (now St Petersburg), RussiaMarkov is particularly remembered for his study of Markov chains, sequences of random variables in which the future variable is determined by the present variable but is independent of the way in which the present state arose from its predecessors. This work launched the theory of stochastic processes.

Page 10: Hmm viterbi

s1 s3

s2

Has N states, called s1, s2 .. sN

There are discrete timesteps, t=0, t=1, …

N = 3

t=0

A Markov System

Page 11: Hmm viterbi

s1 s3

s2

Has N states, called s1, s2 .. sN

There are discrete timesteps, t=0, t=1, …

On the t’th timestep the system is in exactly one of the available states. Call it qt

Note: qt {s1, s2 .. sN }

N = 3

t=0

qt=q0=s3

Current State

A Markov System

Page 12: Hmm viterbi

A Markov System

s1 s3

s2

Has N states, called s1, s2 .. sN

There are discrete timesteps, t=0, t=1, …

On the t’th timestep the system is in exactly one of the available states. Call it qt

Note: qt {s1, s2 .. sN }

Between each timestep, the next state is chosen randomly.

N = 3

t=1

qt=q1=s2

Current State

Page 13: Hmm viterbi

s1 s3

s2

Has N states, called s1, s2 .. sN

There are discrete timesteps, t=0, t=1, …

On the t’th timestep the system is in exactly one of the available states. Call it qt

Note: qt {s1, s2 .. sN }

Between each timestep, the next state is chosen randomly.

The current state determines the probability distribution for the next state.N = 3

t=1

qt=q1=s2

P(qt+1=s1|qt=s3) = 1/3

P(qt+1=s2|qt=s3) = 2/3

P(qt+1=s3|qt=s3) = 0

P(qt+1=s1|qt=s1) = 0

P(qt+1=s2|qt=s1) = 0

P(qt+1=s3|qt=s1) = 1

P(qt+1=s1|qt=s2) = 1/2

P(qt+1=s2|qt=s2) = 1/2

P(qt+1=s3|qt=s2) = 0

Page 14: Hmm viterbi

s1 s3

s2

Has N states, called s1, s2 .. sN

There are discrete timesteps, t=0, t=1, …

On the t’th timestep the system is in exactly one of the available states. Call it qt

Note: qt {s1, s2 .. sN }

Between each timestep, the next state is chosen randomly.

The current state determines the probability distribution for the next state.N = 3

t=1

qt=q1=s2

P(qt+1=s1|qt=s3) = 1/3

P(qt+1=s2|qt=s3) = 2/3

P(qt+1=s3|qt=s3) = 0

P(qt+1=s1|qt=s1) = 0

P(qt+1=s2|qt=s1) = 0

P(qt+1=s3|qt=s1) = 1

P(qt+1=s1|qt=s2) = 1/2

P(qt+1=s2|qt=s2) = 1/2

P(qt+1=s3|qt=s2) = 0

1/2

1/2

1/3

2/3

1

Often notated with arcs between states

Page 15: Hmm viterbi

)( 1 ii sqP

s1 s3

s2qt+1 is conditionally independent of { qt-1, qt-2, … q1, q0 } given qt.

In other words:

P(qt+1 = sj |qt = si ) =

P(qt+1 = sj |qt = si ,any earlier history)

Notation:N = 3

t=1

qt=q1=s2

P(qt+1=s1|qt=s3) = 1/3

P(qt+1=s2|qt=s3) = 2/3

P(qt+1=s3|qt=s3) = 0

P(qt+1=s1|qt=s1) = 0

P(qt+1=s2|qt=s1) = 0

P(qt+1=s3|qt=s1) = 1

P(qt+1=s1|qt=s2) = 1/2

P(qt+1=s2|qt=s2) = 1/2

P(qt+1=s3|qt=s2) = 0

1/2

1/2

1/3

2/3

1

)|( 1 jitij sqsqPa

Markov Property

Page 16: Hmm viterbi

)( 1 ii sqP

s1 s3

s2qt+1 is conditionally independent of { qt-1, qt-2, … q1, q0 } given qt.

In other words:

P(qt+1 = sj |qt = si ) =

P(qt+1 = sj |qt = si ,any earlier history)

Notation:N = 3

t=1

qt=q1=s2

P(qt+1=s1|qt=s3) = 1/3

P(qt+1=s2|qt=s3) = 2/3

P(qt+1=s3|qt=s3) = 0

P(qt+1=s1|qt=s1) = 0

P(qt+1=s2|qt=s1) = 0

P(qt+1=s3|qt=s1) = 1

P(qt+1=s1|qt=s2) = 1/2

P(qt+1=s2|qt=s2) = 1/2

P(qt+1=s3|qt=s2) = 0

1/2

1/2

1/3

2/3

1

)|( 1 jitij sqsqPa

Markov Property

Transition probability

Initial probability

Page 17: Hmm viterbi

Example: A Simple Markov Model For Weather Prediction

• Any given day, the weather can be described as being in one of three states:– State 1: precipitation (rain, snow, hail, etc.)– State 2: cloudy– State 3: sunnyTransitions between states are described by the transition matrix

This model can then be described by the following directed graph

Page 18: Hmm viterbi

Basic Calculations• Example: What is the probability that the

weather for eight consecutive days is “sun-sun-sun-rain-rain-sun-cloudy-sun”?

• Solution:• O = sun sun sun rain rain sun cloudy sun 3 3 3 1 1 3 2 3

Page 19: Hmm viterbi

From Markov To Hidden Markov• The previous model assumes that each state can be uniquely

associated with an observable event– Once an observation is made, the state of the system is then trivially

retrieved– This model, however, is too restrictive to be of practical use for most

realistic problems• To make the model more flexible, we will assume that the

outcomes or observations of the model are a probabilistic function of each state– Each state can produce a number of outputs according to a unique

probability distribution, and each distinct output can potentially be generated at any state

– These are known a Hidden Markov Models (HMM), because the state sequence is not directly observable, it can only be approximated from the sequence of observations produced by the system

Page 20: Hmm viterbi

The coin-toss problem • To illustrate the concept of an HMM consider the following

scenario– Assume that you are placed in a room with a curtain– Behind the curtain there is a person performing a coin-toss

experiment– This person selects one of several coins, and tosses it: heads (H) or

tails (T)– The person tells you the outcome (H,T), but not which coin was used

each time• Your goal is to build a probabilistic model that best explains

a sequence of observations O={o1,o2,o3,o4,…}={H,T,T,H,,…}– The coins represent the states; these are hidden because you do not

know which coin was tossed each time– The outcome of each toss represents an observation– A “likely” sequence of coins may be inferred from the observations,

but this state sequence will not be unique•

Page 21: Hmm viterbi

The Coin Toss Example – 2 coins

Page 22: Hmm viterbi

From Markov to Hidden Markov Model: The Coin Toss Example – 3 coins

Page 23: Hmm viterbi

Hidden model

• As spectators, we can not tell which coin is being used, all we can observe is the output (head/tail)

• We assume the outputs are based on coin tendencies (output) probabilities

Page 24: Hmm viterbi

Coin Toss Example

C1 C2 CL-1 CL

P1 P2 PL-1 PL

Ci

Pi

hidden state variables= coins

observed data(“output”) = heads/tails

L

Page 25: Hmm viterbi

Hidden Markov Models

• Used when states can not directly be observed, good for noisy data

• Requirements:– A finite number of states, each with an output probability

distribution– State transition probabilities– Observed phenomenon, which can be randomly generated

given state-associated probabilities.

Page 26: Hmm viterbi

HMM Notation(from Rabiner’s Survey)

The states are labeled S1 S2 .. SN

For a particular trial….Let T be the number of observations

T is also the number of states passed throughO = O1 O2 .. OT is the sequence of observations

Q = q1 q2 .. qT is the notation for a path of states

λ = N,M,i,,aij,bi(j) is the specification of an HMM

*L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989.

Page 27: Hmm viterbi

HMM Formal DefinitionAn HMM, λ, is a 5-tuple consisting of• N the number of states• M the number of possible observations• {1, 2, .. N} The starting state probabilities

P(q0 = Si) = i

• a11 a22 … a1N a21 a22 … a2N

: : : aN1 aN2 … aNN

• b1(1) b1(2) … b1(M) b2(1) b2(2) … b2(M) : : : bN(1) bN(2) … bN(M)

The state transition probabilities

P(qt+1=Sj | qt=Si)=aij

The observation probabilities

P(Ot=k | qt=Si)=bi(k)

Page 28: Hmm viterbi

Assumptions

• Markov assumption– States depend on previous states

• Stationary assumption– Transition probabilities are independent of time

(“memoryless”)• Output independence

– Observations are independent of previous observations

Page 29: Hmm viterbi

The three main questions on HMMs

• Evaluation– What is the probability that the observations were

generated by a given model?• Decoding

– Given a model and a sequence of observations, what is the most likely state observations?

• Learning:– Given a model and a sequence of observations, how

should we modify the model parameters to maximize p{observe|model)

Page 30: Hmm viterbi

The three main questions on HMMs

1. Evaluation

GIVEN a HMM M, and a sequence x,FIND Prob[ x | M ]

2. Decoding

GIVEN a HMM M, and a sequence x,FIND the sequence of states that maximizes P[ x, | M ]

3. Learning

GIVEN a HMM M, with unspecified transition/emission probs.,and a sequence x,

FIND parameters = (bi(.), aij) that maximize P[ x | ]

Page 31: Hmm viterbi

Let’s not be confused by notation

P[ x | M ]: The probability that sequence x was generated by the model

The model is: architecture (#states, etc) + parameters = aij, ei(.)

So, P[ x | ], and P[ x ] are the same, when the architecture, and the entire model, respectively, are implied

Similarly, P[ x, | M ] and P[ x, ] are the same

In the LEARNING problem we always write P[ x | ] to emphasize that we are seeking the that maximizes P[ x | ]

Page 32: Hmm viterbi

HMMs

Page 33: Hmm viterbi

Hidden Markov Models

• Used when states can not directly be observed, good for noisy data

• Requirements:– A finite number of states, each with an output probability

distribution– State transition probabilities– Observed phenomenon, which can be randomly generated

given state-associated probabilities.

Page 34: Hmm viterbi

Specification of an HMM

• N - number of states– Q = {q1; q2; : : : ;qT} - set of states

• M - the number of symbols (observables)– O = {o1; o2; : : : ;oT} - set of symbols

Description

Page 35: Hmm viterbi

Specification of an HMM

• A - the state transition probability matrix– aij = P(qt+1 = j|qt = i)

• B- observation probability distribution– bj(k) = P(ot = k|qt = j) i ≤ k ≤ M

• π - the initial state distribution

Description

Page 36: Hmm viterbi

HMM Formal DefinitionAn HMM, λ, is a 5-tuple consisting of• N the number of states• M the number of possible observations• {1, 2, .. N} The starting state probabilities

P(q0 = Si) = i

• a11 a22 … a1N a21 a22 … a2N

: : : aN1 aN2 … aNN

• b1(1) b1(2) … b1(M) b2(1) b2(2) … b2(M) : : : bN(1) bN(2) … bN(M)

The state transition probabilities

P(qt+1=Sj | qt=Si)=aij

The observation probabilities

P(Ot=k | qt=Si)=bi(k)

Page 37: Hmm viterbi

Assumptions

• Markov assumption– States depend on previous states

• Stationary assumption– Transition probabilities are independent of time

(“memoryless”)• Output independence

– Observations are independent of previous observations

Page 38: Hmm viterbi

The three main questions on HMMs

• Evaluation– What is the probability that the observations were

generated by a given model?• Decoding

– Given a model and a sequence of observations, what is the most likely state observations?

• Learning:– Given a model and a sequence of observations, how

should we modify the model parameters to maximize p{observe|model)

Page 39: Hmm viterbi

Central problems in HMM modelling

• Problem 1Evaluation:– Probability of occurrence of a particular

observation sequence, O = {o1,…,ok}, given the model

– P(O|λ)– Complicated – hidden states– Useful in sequence classification

Centralproblems

Page 40: Hmm viterbi

Central problems in HMM modelling

• Problem 2Decoding:– Optimal state sequence to produce given

observations, O = {o1,…,ok}, given model

– Optimality criterion– Useful in recognition problems

Centralproblems

Page 41: Hmm viterbi

Central problems in HMM modelling

• Problem 3Learning:– Determine optimum model, given a training set of

observations– Find λ, such that P(O|λ) is maximal

Centralproblems

Page 42: Hmm viterbi

Task: Part-Of-Speech Tagging

• Goal: Assign the correct part-of-speech to each word (and punctuation) in a text.

• Example:

• Learn a local model of POS dependencies, usually from pretagged data

Two old men bet on the game .CRD ADJ NN VBD Prep Det NN SYM

Page 43: Hmm viterbi

Hidden Markov Models

• Assume: POS generated as random process, and each POS randomly generates a word

Det

NN

NNS

ADJ

0.20.3

0.5

0.3

0.5

0.9

0.2

0.1“the”

“a” 0.6

0.4

“cat”“bet”

“cats”

“men”

Page 44: Hmm viterbi

• First-order (bigram) Markov assumptions:– Limited Horizon: Tag depends only on previous tag

P(ti+1 = tk | t1=tj1,…,ti=tji) = P(ti+1 = tk | ti = tj)

– Time invariance: No change over time

P(ti+1 = tk | ti = tj) = P(t2 = tk | t1 = tj) = P(tj tk)

• Output probabilities:– Probability of getting word wk for tag tj: P(wk | tj)– Assumption:

Not dependent on other tags or words!

HMMs For Tagging

Page 45: Hmm viterbi

Combining Probabilities• Probability of a tag sequence:P(t1t2…tn) = P(t1)P(t1t2)P(t2t3)…P(tn-1tn)Assume t0 – starting tag:

= P(t0t1)P(t1t2)P(t2t3)…P(tn-1tn)

• Prob. of word sequence and tag sequence:

P(W,T) = i P(ti-1ti) P(wi | ti)

Page 46: Hmm viterbi

Training from labeled training

• Labeled training = each word has a POS tag• Thus:

π(tj) = PMLE(tj) = C(tj) / N

a(tjtk) = PMLE(tjtk) = C(tj, tk) / C(tj)

b(tj | wk) = PMLE(tj | wk) = C(tj:wk) / C(wk)

• Smoothing applies as usual

Page 47: Hmm viterbi

Three Basic Problems

• Compute the probability of a text: Pλ(W1,N)

• Compute maximum probability tag sequence:

arg maxT1,N Pλ(T1,N | W1,N)

• Compute maximum likelihood modelarg max λ Pλ(W1,N)

Page 48: Hmm viterbi

Problem 1: Naïve solution

• State sequence Q = (q1,…qT)

• Assume independent observations:

)()...()(),|(,|( 22111

TqTqqt

T

it obobobqoPqOP

Centralproblems

NB Observations are mutually independent, given the hidden states. (Joint distribution of independent variables factorises into marginal distributions of the independent variables.)

Page 49: Hmm viterbi

Problem 1: Naïve solution

• Observe that :

• And that:

qTqTqqqqq aaaqP 132211 ...)|(

Centralproblems

q

qPqOPOP )|(),|()|(

Page 50: Hmm viterbi

Problem 1: Naïve solution

• Finally get:

Centralproblems

q

qPqOPOP )|(),|()|(

NB:-The above sum is over all state paths-There are NT states paths, each ‘costing’ O(T) calculations, leading to O(TNT) time complexity.

Page 51: Hmm viterbi

Problem 1: Efficient solution

• Define auxiliary forward variable α:

Centralproblems

),|,...,()( 1 iqooPi ttt

αt(i) is the probability of observing a partial sequence ofobservables o1,…ot such that at time t, state qt=i

Forward algorithm:

Page 52: Hmm viterbi

Problem 1: Efficient solution• Recursive algorithm:

– Initialise:

– Calculate:

– Obtain:

)()( 11 obi ii

Centralproblems

)(])([)( 11

1

tj

N

iijtt obaij

N

iT iOP

1

)()|(

Complexity is O(N2T)

(Partial obs seq to t AND state i at t) x (transition to j at t+1) x (sensor)

Sum of different ways of getting obs seq

Sum, as can reach j from any preceding state

incorporates partial obs seq to t

Page 53: Hmm viterbi

Forward AlgorithmDefine αk(i) = P(w1,k, tk=ti)

1. For i = 1 To N: α1(i) = a(t0ti)b(w1 | ti)

• For k = 2 To T; For j = 1 To N:

αk(j) = [i αk-1(i)a(titj)]b(wk | tj)

• Then: Pλ(W1,T) = i αT(i)

Complexity = O(N2 T)

Page 54: Hmm viterbi

Forward Algorithm

t1

t2

t5

t4

t3

w1

t1

t2

t5

t4

t3

w2

t1

t2

t5

t4

t3

w3

α1(1)

α1(2)

α1(5)

α1(4)

α1(3)

a(t0ti)

α2(1)

α2(2)

α2(5)

α2(4)

α2(3)

α3(1)

α3(2)

α3(5)

α3(4)

α3(3)

a(t1t1)

a(t2t1)

a(t3t1)

a(t4t1)

a(t5t1)

a(t1t1)

a(t2t1)

a(t3t1)

a(t4t1)

a(t5t1)

Pλ(W1,3)

Page 55: Hmm viterbi

Problem 1: Alternative solution

• Define auxiliary forward variable β:

Centralproblems

Backward algorithm:

),|,...,,()( 21 iqoooPi tTttt

t(i) – the probability of observing a sequence ofobservables ot+1,…,oT given state qt =i at time t, and

Page 56: Hmm viterbi

Problem 1: Alternative solution• Recursive algorithm:

– Initialise:

– Calculate:

– Terminate:

1)( jT

Centralproblems

Complexity is O(N2T)

1 11

( ) ( ) ( )N

t t ij j tj

i j a b o

N

i

iOp1

1 )()|( 1,...,1Tt

Page 57: Hmm viterbi

Backward AlgorithmDefine βk(i) = P(wk+1,N | tk=ti) --note the difference!

1. For i = 1 To N: βT(i) = 1

1. For k = T-1 To 1; For j = 1 To N:

βk(j) = [i a(tjti)b(wk+1 | ti) βk+1(i)]2. Then:

Pλ(W1,T) = i a(t0ti)b(w1 | ti) β1(i)

Complexity = O(Nt2 N)

Page 58: Hmm viterbi

Backward Algorithm

t1

t2

t5

t4

t3

w1

t1

t2

t5

t4

t3

w2

t1

t2

t5

t4

t3

w3

β1(1)

β1(2)

β1(5)

β1(4)

β1(3)

a(t0ti)

β2(1)

β2(2)

β2(5)

β2(4)

β2(3)

β3(1)

β3(2)

β3(5)

β3(4)

β3(3)

a(t1t1)

a(t2t1)

a(t3t1)

a(t4t1)

a(t5t1)

a(t1t1)

a(t2t1)

a(t3t1)

a(t4t1)

a(t5t1)

Pλ(W1,3)

Page 59: Hmm viterbi

Viterbi Algorithm (Decoding)• Most probable tag sequence given text:

T* = arg maxT Pλ(T | W)

= arg maxT Pλ(W | T) Pλ(T) / Pλ(W)(Bayes’ Theorem)

= arg maxT Pλ(W | T) Pλ(T)(W is constant for all T)

= arg maxT i[a(ti-1ti) b(wi | ti) ]= arg maxT i log[a(ti-1ti) b(wi | ti) ]

Page 60: Hmm viterbi

t1

t2

t3

w1

t1

t2

t3

w2

t1

t2

t3

w3

t0

A(,) t1 t2 t3

t0 0.005 0.02 0.1

t1 0.02 0.1 0.005

t2 0.5 0.0005 0.0005

t3 0.05 0.05 0.005

B(,) w1 w2 w3

t1 0.2 0.005 0.005

t2 0.02 0.2 0.0005

t3 0.02 0.02 0.05

Page 61: Hmm viterbi

-log A t1 t2 t3

t0 2.3 1.7 1

t1 1.7 1 2.3

t2 0.3 3.3 3.3

t3 1.3 1.3 2.3

-log B w1 w2 w3

t1 0.7 2.3 2.3

t2 1.7 0.7 3.3

t3 1.7 1.7 1.3

t1

t2

t3

w1

t1

t2

t3

w2

t1

t2

t3

w3

t0

-1.7

-0.3

-1.3

-3

-3.4

-2.7

-2.3

-1.7

-1

-6

-4.7

-6.7

-1.7

-0.3

-1.3

-7.3

-9.3

-10.3

Page 62: Hmm viterbi

Viterbi Algorithm1. D(0, START) = 02. for each tag t != START do: D(1, t) = -3. for i 1 to N do:

a. for each tag tj do:D(i, tj) maxk D(i-1,tk)b(wi|tj)a(tktj)

D(i, tj) maxk D(i-1,tk) + logb(wi|tj) + loga(tktj)

4. log P(W,T) = maxj D(N, tj)

where logb(wi|tj) = log b(wi|tj) and so forth

Page 63: Hmm viterbi
Page 64: Hmm viterbi

start

fair loaded

Heads

Tails

Heads

Tails

0.5 0.5

0.1

0.1

0.90.9

Question: Suppose the sequence of our game is: HHHTHHHTTHHTH?

What is the probability of the sequence given the model?

Page 65: Hmm viterbi

Decoding

• Suppose we have a text written by Shakespeare and a monkey. Can we tell who wrote what?

• Text: Shakespeare or Monkey?

• Case 1: – Fehwufhweuromeojulietpoisonjigjreijge

• Case 2:– mmmmbananammmmmmmbananammm