hidden markov modelsread.pudn.com/.../555264/lecture9-hidden-markov-models.pdf · 2008-04-12 ·...

Hidden Markov ModelsReference:A Tutorial on Hidden Markov Models and

Selected Applications in Speech Recognition, L. R. Rabiner, Proceedings of the IEEE, Vol. 77, No. 2, 1989

Outlinen Introductionn Markov Modelsn Hidden Markov Modelsn Forward/Backward Algorithmsn Viterbi Algorithmn Baum-Welch estimation algorithm

Introductionn Input consists of a sequence of signalsn Types of Signal Models

n Deterministic modelsn sine wave, sum of exponentials

n Statistical modelsn Gaussian process, Markov process, hidden Markov process

n Examples of Applicationsn Speech Recognitionn Word-Sense Disambiguationn DNA Sequence Modelingn Text Modeling and Information Extraction

Markov Models

Markov Models

n States are observable

Markov Models

Weather Modeln States

n R: Rainy, C: Cloudy, S: Sunny

n State Transition Probability Matrix

n What is the probability of observing O=SSRRSCS given that today is S?

Weather Modeln Basic Rule: P(A, B) = P(A|B)P(B)n Markov chain rule:

Weather Modeln Observation Sequence O:

O = (S, S, S, R, R, S, C, S)n By Chain Rule

initial probability

πi = P(q1=i)

Durationn What’s the probability that the sequence

remains in state i for exactly d time units?

n duration density is exponentialn expected value of duration d in state i

Duration

Hidden Markov Modelsn States are not observablen Observations are probabilistic functions of

statesn State transitions are still probabilistic

Coin Toss Modelsn Scenario: You are in a room with a barrier

through which you cannot see what is happening. On the other side of barrier another person is performing a coin (or multiple coin) tossing experiment. He will only tell you the result of each coin flip.

n The problem is how do we build a model to explain the observed sequence of heads and tails?

Coin Toss Modelsn Observation:a sequence of heads and tails

n Build a HMM to explain the observed sequencen What the states correspond to?n How many states? (How many coins?)n What are the parameters?

One-coin Modeln Each state corresponds to a side of the

coin (observation generator)n observable Markov model

n Corresponds to a 1-state HMM

Two-coin Modeln Each state corresponds to a biased coinn Hidden Markov Model

n Transition matrix are estimated by a set of independent coin tosses

Three-coin Model

Model Selectionn Which model best matches the actual

observations?n 1-coin model: 1 unknown parametern 2-coin model: 4 unknown parametersn 3-coin model: 9 unknown parametersn Larger HMMs will match better than

smaller HMMsn Impose strong limitation on the size of

models

Urn and Ball Model

Headers of Scientific Paper

•Citation Index

•Citation Database

•Each state corresponds to one component of the paper header.

•application: information extraction

DNA Sequence Modeling

•Each state corresponds to one position.

•application: profile HMM

Elements of a HMMn Q={1,2,… ,N} : set of hidden statesn V={1,2,… ,M} : set of observation symbolsn A: state transition probability matrix

n aij = P(qt+1=j|qt=i)

n B: observation symbol probabilityn bj(k) = P(ot=k|qt=j)

n π: initial state distributionn πi = P(q1=i)

n λ: the entire model λ=(A,B,π)

Sequence Generatorn generate a sequence of T observations

O=(o1,o2,… ,oT)1. Choose an initial state q1 = Si according to

state distribution, and set t = 12. Choose Ot=vk according to the symbol

probability distribution in state Si, i.e. bi(k)3. Transit to a new state qt+1=Sj according to

the state transition probability distribution for state Si, i.e. aij

4. Set t = t+1; go to step 3 if t < T; otherwise terminate the procedure

Execution of HMM

State sequence corresponds to a path in the grids

Three Basic Problemsn compute the probability that the model

generates the observation sequencen find the optimal state sequence that

generates the observation sequencen learn a HMM that best fits the observation

sequences

Basic Problem 1n Given observation O=(o1,o2,… ,oT) and

model λ=(A,B,π), efficiently compute P(O|λ)n P(O|λ) is the probability that O is produced by λn Hidden states complicates the probability

evaluationn Given two models λ1 and λ2, the probability

(score) can be used to choose the better onen λi models some protein familyn O denotes a proteinn find the most probable protein family for On speech recognitionn on-line handwritten character recognition

Basic Problem 2n Given observation O=(o1,o2,… ,oT) and

model λ=(A,B,π), find the optimal state sequence q=(q1,q2,… ,qT)n to uncover the hidden part of the modeln Optimality criterion has to be decided (e.g.

maximum likelihood)n find “explanation” for the data

n O is the header of some scientific papern find title, author, publication date, … of the papern a fundamental problem in citation index

generationn word-sense disambiguation, gene finding

Basic Problem 3n Given observation O=(o1,o2,… ,oT),

estimate model parameters λ=(A,B,π) that maximizes P(O|λ)n to train the modeln find the best topologyn find the best parameters

Word Speech Recognizern the speech signal of each word is

represented as a time sequence of coded spectral vectors

n build a HMM for each word; training sequence consists of sequences of codebook from one for more talkers

n recognition of an unknown word is performed by choosing the word whose model score is highest (i.e. the highest likelihood)

Solution to Problem 1n Problem: compute P(o1,o2,… ,oT|λ)n Consider state sequence q=(q1,q2,… ,qT)n Assume observations are independent

n P(O|q,λ) = Πi=1,… ,T(ot|qt,λ)= bq1(o1) bq2(o2)… bqT(oT)

n P(q|λ) = πq1aq1q2aq2q3… aqT-1qT

n P(O|λ) = Σq P(O|q,λ)P(q|λ)

n NT state sequences each with O(T) timen Complexity O(TNT)n For N=5, T=100, TNT=100x5100 ~ 1072

Forward Algorithm: Intuition

the probability of observing the partial sequence (o1,o2,… ,ot) such that state qt is i

αt(i) = P(o1,o2,… ,ot,qt=i|λ)

Forward Algorithmn forward variable αt(i) = P(o1,o2,… ,ot,qt=i|λ)n αt(i) is the probability of observing the partial

sequence (o1,o2,… ,ot) such that state qt is Si

n Initialization: α1(i) = πibi(o1)n Induction:

n Termination:

n Complexity: O(N2T)

)()()( 11

1 +=

+

= ∑ tj

N

iijtt obaij αα

∑ ==

N

i T iOP1

)()|( αλ

Backward Algorithm: Intuition

The probability of observing the partial sequence (ot+1,ot+2,… ,oT) such that state qt is i

βt(i) = P(ot+1,ot+2,… ,oT|qt=i,λ)

∑=

++=N

jttjijt jobai

111 )()()( ββ

∑ ==

N

i ii iobOP1 11 )()()|( βπλ

Backward Algorithmn backward variable

βt(i) = P(ot+1,ot+2,… ,oT|qt=i,λ)n βt(i) is the probability of observing the partial

sequence (ot+1,ot+2,… ,oT) such that state qtis i

n Initialization: βT(i) = 1n Induction:

n Termination: n Complexity: O(N2T)

∑=

++=N

jttjijt jobai

111 )()()( ββ ∑

=++=

N

jttjijt jobai

111 )()()( ββ

∑ ==

N

i ii iobOP1 11 )()()|( βπλ

Combing Forward and Backward

)()(),|,,()|,,,(

),,,,|,,(

)|,,,()|,,,,,,(

)|,,,()|,(

11

11

11

11

1

iiiqooPiqooP

iqooooP

iqooPooiqooP

iqooPiqOP

tt

tTttt

ttTt

tt

Tttt

tTt

βαλλ

λλ

λλλ

==×==

=×==

=====

−

−

−

−

LLLL

LLL

L

TtiiOPN

i tt ≤≤= ∑ =1 ,)()()|(

1βαλ

Solution to Problem 2n Find the most likely pathn Find the path that maximizes likelihood:

P(q1,q2,… ,qT|O,λ) which is equivalent to maximize P(q1,q2,… ,qT,O|λ)

n definen δt(i) is the highest prob. path ending at

state in by induction,

)|,,,,,,,(max)( 2121,,, 121

λδ ttqqqt oooiqqqPit

LLL

==−

)(])([max)( 11 ++ ⋅= tjijti

t obaij δδ

Viterbi Algorithm)(])([max)( 11 ++ ⋅= tjijt

it obaij δδ

)|,,,,,,,(max)( 2121,,, 121

λδ ttqqqt oooiqqqPit

LLL

==−

)(max1

* iP TNi

δ≤≤

=

Viterbi Algorithmn Initialization:

n Recursion:

n Termination:

n Path (state sequence) backtracking:

)( tj ob

Solution to Problem 3n estimate λ=(A,B,π) to maximize P(O|λ)n no analytic method because of complexity –

iterative methodn is the probability of being in state i at

time t, and in state j at time t+1n

∑ ∑= = ++

++

++

=

=

N

k

N

l ttlklt

ttjijt

ttjijtt

lobak

jobaiOP

jobaiji

1 1 11

11

11

)()()(

)()()(

)|(

)()()(),(

βα

βαλ

βαξ

),( jitξ

Expectation Maximizationn a’ij = (expected number of transitions from

state i to state j) / (expected number of transitions from state i)

n b’j(k) = (expected number of times in state j and observing symbol k) / (expected number of times in state j)

n P(O|λ’)>P(O|λ)

Expectation Maximization

∑ ∑∑

−

= =

−

=

++

=

===

===

1

1 1

1

1

11

),(

),(

),()|(

)|,,(),|,(

T

t

N

k t

T

t tij

ttt

tt

ki

jia

jiOP

OjqiqPOjqiqP

ξ

ξ

ξλ

λλ

Part-of-Speech Tagging

POS Taggingn labeling each word in a sentence with its

appropriate part of speech, i.e. noun, verb, adjective, …n The-AT representative-NN put-VBD chairs-

NNS on-IN the-AT table-NN.

n The-AT representative-JJ put-NN chairs-VBZon-IN the-AT table-NN.

Information Sources in Tagging

n context informationn a new playn play football

n syntagmatic informationn AT JJ NN commonn AT JJ VBP extremely rare

n lexical informationn tag distribution of a work is extremely unevenn basic tag v.s. derived tagsn dumb tagger achieves 90% accuracy

Summary

n Rules-based tagger using syntagmaticpatterns is about 77% (Greene and Rubin, 1977)

n A dumb tagger (basic tag) is 90% (Charbiak, 1993)

n HMM tagger is about 97%

HMM Taggers

n States of the HMM are tagsn transition probability

n emission probability

n tag sequence :

)(),(

)|( j

kjjk

tCttC

ttP =

)(),(

)|( j

kljl

tCtwC

twP =

)|(maxarg ,1,1,1

nnt

wtPn

095413294656758016PRD

152212914764758426072VB

21392614117734247037201067NN

1850173141325043322IN

38018742601973BEZ

19048636000AT

PRDVBNNINBEZATwl tj

)(),(

)|( j

kljl

tCtwC

twP =

Transition Probability

Emission Probability

4880900000.0000069016the

04108000progress

00382000president

000548400on

013336000move

0000100650is

04310000bear

PRDVBNNINBEZAT

hidden markov modelsread.pudn.com/.../555264/lecture9-hidden-markov-models.pdf · 2008-04-12 ·...

Documents