hmm algorithms - school of informatics...an hmm, determine the most probable hidden state...

65
HMM Algorithms Peter Bell Automatic Speech Recognition— ASR Lecture 3 20 January 2020 ASR Lecture 3 HMM Algorithms 1

Upload: others

Post on 07-Aug-2021

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

HMM Algorithms

Peter Bell

Automatic Speech Recognition— ASR Lecture 320 January 2020

ASR Lecture 3 HMM Algorithms 1

Page 2: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Overview

HMM algorithms

HMM recap

HMM algorithms

Likelihood computation (forward algorithm)Finding the most probable state sequence (Viterbi algorithm)Estimating the parameters (forward-backward and EMalgorithms)

Warning: the maths continues!

ASR Lecture 3 HMM Algorithms 2

Page 3: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Overview

HMM algorithms

HMM recap

HMM algorithms

Likelihood computation (forward algorithm)Finding the most probable state sequence (Viterbi algorithm)Estimating the parameters (forward-backward and EMalgorithms)

Warning: the maths continues!

ASR Lecture 3 HMM Algorithms 2

Page 4: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Recap: the HMM

qt-1 qt qt+1

xt-1 xt xt+1

P(qt|qt-1) P(qt+1|qt)

P(xt-1|qt-1) P(xt|qt) P(xt+1|qt+1)

A generative model for the sequence X = (x1, . . . , xT )

Discrete states qt are unobserved

qt+1 is conditionally independent of q1, . . . , qt−1, given qt

Observations xt are conditionally independent of each other,given qt .

ASR Lecture 3 HMM Algorithms 3

Page 5: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

HMMs for ASR

The three-state left-to-right topology for phones:

r1 r2 r3 ai1 ai2 ai3r1 r2 r3 t1 t2 t3

x1 x2 x3 x4 ...

ASR Lecture 3 HMM Algorithms 4

Page 6: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Computing likelihoods with the HMM

Joint likelihood of X and Q = (q1, . . . , qT ):

P(X ,Q|λ) = P(q1)P(x1|q1)P(q2|q1)P(x2|q2) . . . (1)

= P(q1)P(x1|q1)T∏t=2

P(qt |qt−1)P(xt |qt) (2)

P(qt) denotes the initial occupancy probability of each state

ASR Lecture 3 HMM Algorithms 5

Page 7: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

HMM parameters

The parameters of the model, λ, are given by:

Transition probabilities akj = P(qt+1 = j |qt = k)

Observation probabilities bj(x) = P(x|q = j)

ASR Lecture 3 HMM Algorithms 6

Page 8: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

The three problems of HMMs

Working with HMMs requires the solution of three problems:

1 Likelihood Determine the overall likelihood of an observationsequence X = (x1, . . . , xt , . . . , xT ) being generated by aknown HMM topology, M.→ the forward algorithm

2 Decoding and alignment Given an observation sequence andan HMM, determine the most probable hidden state sequence→ the Viterbi algorithm

3 Training Given an observation sequence and an HMM, learnthe state occupation probabilities, in order to find the bestHMM parameters λ = {{ajk}, {bj()}}→ the forward-backward and EM algorithms

ASR Lecture 3 HMM Algorithms 7

Page 9: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

The three problems of HMMs

Working with HMMs requires the solution of three problems:

1 Likelihood Determine the overall likelihood of an observationsequence X = (x1, . . . , xt , . . . , xT ) being generated by aknown HMM topology, M.→ the forward algorithm

2 Decoding and alignment Given an observation sequence andan HMM, determine the most probable hidden state sequence→ the Viterbi algorithm

3 Training Given an observation sequence and an HMM, learnthe state occupation probabilities, in order to find the bestHMM parameters λ = {{ajk}, {bj()}}→ the forward-backward and EM algorithms

ASR Lecture 3 HMM Algorithms 7

Page 10: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

The three problems of HMMs

Working with HMMs requires the solution of three problems:

1 Likelihood Determine the overall likelihood of an observationsequence X = (x1, . . . , xt , . . . , xT ) being generated by aknown HMM topology, M.→ the forward algorithm

2 Decoding and alignment Given an observation sequence andan HMM, determine the most probable hidden state sequence→ the Viterbi algorithm

3 Training Given an observation sequence and an HMM, learnthe state occupation probabilities, in order to find the bestHMM parameters λ = {{ajk}, {bj()}}→ the forward-backward and EM algorithms

ASR Lecture 3 HMM Algorithms 7

Page 11: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Notes on the HMM topology

By the HMM topology, M, we can mean:

A restricted left-to-right topology based on a knownword/sentence, leading to a “trellis-like” structure over time

A much less restricted topology based on a grammar orlanguage model – or something in between

The forward/backward algorithms are not (generally) suitablefor unrestricted topologies

ASR Lecture 3 HMM Algorithms 8

Page 12: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Example: trellis for /k ae t/

x

15x15 16x

16

17x

17

18x

18

19x

S0

x1

2

3

1

19observations

time

x 13x

13

12

1412

11x

11

10x

10

9

9

8x

8

S

1

3

2

1

3

2

S

S

S

S

S

S

S

S

7x72 6x5x4x3xx

51 2 3 4 6

14x

/k/

/ae/

/t/

ASR Lecture 3 HMM Algorithms 9

Page 13: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

1. Likelihood

Goal: determine p(X |M)

Sum over all possible state sequences Q = (q1, . . . , qT ) thatcould result in the observation sequence X

p(X|M) =∑Q∈Q

P(X,Q|M)

= P(q1)P(x1|q1)T∏t=2

P(qt |qt−1)P(xt |qt)

How many paths Q do we have to calculate?

∼ N × N × · · ·N︸ ︷︷ ︸T times

= NT N : number of HMM statesT : length of observation

e.g. NT ≈ 1010 for N=3, T =20

Computation complexity of multiplication: O(2TNT )

ASR Lecture 3 HMM Algorithms 10

Page 14: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

1. Likelihood

Goal: determine p(X |M)

Sum over all possible state sequences Q = (q1, . . . , qT ) thatcould result in the observation sequence X

p(X|M) =∑Q∈Q

P(X,Q|M)

= P(q1)P(x1|q1)T∏t=2

P(qt |qt−1)P(xt |qt)

How many paths Q do we have to calculate?

∼ N × N × · · ·N︸ ︷︷ ︸T times

= NT N : number of HMM statesT : length of observation

e.g. NT ≈ 1010 for N=3, T =20

Computation complexity of multiplication: O(2TNT )

ASR Lecture 3 HMM Algorithms 10

Page 15: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

1. Likelihood

Goal: determine p(X |M)

Sum over all possible state sequences Q = (q1, . . . , qT ) thatcould result in the observation sequence X

p(X|M) =∑Q∈Q

P(X,Q|M)

= P(q1)P(x1|q1)T∏t=2

P(qt |qt−1)P(xt |qt)

How many paths Q do we have to calculate?

∼ N × N × · · ·N︸ ︷︷ ︸T times

= NT N : number of HMM statesT : length of observation

e.g. NT ≈ 1010 for N=3, T =20

Computation complexity of multiplication: O(2TNT )

ASR Lecture 3 HMM Algorithms 10

Page 16: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

1. Likelihood

Goal: determine p(X |M)

Sum over all possible state sequences Q = (q1, . . . , qT ) thatcould result in the observation sequence X

p(X|M) =∑Q∈Q

P(X,Q|M)

= P(q1)P(x1|q1)T∏t=2

P(qt |qt−1)P(xt |qt)

How many paths Q do we have to calculate?

∼ N × N × · · ·N︸ ︷︷ ︸T times

= NT N : number of HMM statesT : length of observation

e.g. NT ≈ 1010 for N=3, T =20

Computation complexity of multiplication: O(2TNT )

ASR Lecture 3 HMM Algorithms 10

Page 17: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Likelihood: The Forward algorithm

The Forward algorithm:

Rather than enumerating each sequence, compute theprobabilities recursively (exploiting the Markov assumption)

Reduces the computational complexity to O(TN2)

Visualise the problem as a state-time trellis

i i i

j j j

k k k

t-1 t t+1

ASR Lecture 3 HMM Algorithms 11

Page 18: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Likelihood: The Forward algorithm

The Forward algorithm:

Rather than enumerating each sequence, compute theprobabilities recursively (exploiting the Markov assumption)

Reduces the computational complexity to O(TN2)

Visualise the problem as a state-time trellis

i i i

j j j

k k k

t-1 t t+1

ASR Lecture 3 HMM Algorithms 11

Page 19: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Likelihood: The Forward algorithm

The Forward algorithm:

Rather than enumerating each sequence, compute theprobabilities recursively (exploiting the Markov assumption)

Reduces the computational complexity to O(TN2)

Visualise the problem as a state-time trellis

i i i

j j j

k k k

t-1 t t+1

ASR Lecture 3 HMM Algorithms 11

Page 20: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

The forward probability

Define the Forward probability, αt( j ): the probability of observingthe observation sequence x1 . . . xt and being in state j at time t:

αj(t) = p(x1, . . . , xt , qt = j |M)

We can recursively compute this probability

ASR Lecture 3 HMM Algorithms 12

Page 21: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Initial and final state probabilities

It what follows it is convenient to define:

an additional single initial state SI = 0, with transitionprobabilities

a0j = P(q1 = j)

denoting the probability of starting in state j

a single final state, SE , with transition probabilities ajEdenoting the probability of the model terminating in state j .

SI and SE are both non-emitting

ASR Lecture 3 HMM Algorithms 13

Page 22: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

1. Likelihood: The Forward recursion

Initialisation

αj(0) = 1 j = 0

αj(0) = 0 j 6= 0

Recursion

αj(t) =J∑

i=1

αi (t − 1)aijbj(xt) 1 ≤ j ≤ J, 1 ≤ t ≤ T

Termination

p(X |M) = αE =J∑

i=1

αi (T )aiE

sI : initial state, sE : final state

ASR Lecture 3 HMM Algorithms 14

Page 23: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

1. Likelihood: Forward Recursion

αj(t) = p(x1, . . . , xt , qt = j |M) =J∑

i=1

αi (t − 1)aijbj(xt)

i i i

j j j

k k k

t-1 t t+1

aij

akj

ajj

αj (t - 1)

αi (t - 1)

αk (t - 1)

αj (t )

ASR Lecture 3 HMM Algorithms 15

Page 24: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Viterbi algorithm

Instead of summing over all possible state sequences, justconsider just the most probable path:

P∗(X |M) = maxQ∈Q

P(X ,Q|M)

Achieve this by changing the summation to a maximisation inthe forward algorithm recursion:

Vj(t) = maxi

Vi (t − 1)aijbj(xt)

If we are performing decoding or forced alignment, then onlythe most likely path is needed

In training, it can be used as an approximation

We need to keep track of the states that make up this path bykeeping a sequence of backpointers to enable a Viterbibacktrace: the backpointer for each state at each timeindicates the previous state on the most probable path

ASR Lecture 3 HMM Algorithms 16

Page 25: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Viterbi algorithm

Instead of summing over all possible state sequences, justconsider just the most probable path:

P∗(X |M) = maxQ∈Q

P(X ,Q|M)

Achieve this by changing the summation to a maximisation inthe forward algorithm recursion:

Vj(t) = maxi

Vi (t − 1)aijbj(xt)

If we are performing decoding or forced alignment, then onlythe most likely path is needed

In training, it can be used as an approximation

We need to keep track of the states that make up this path bykeeping a sequence of backpointers to enable a Viterbibacktrace: the backpointer for each state at each timeindicates the previous state on the most probable path

ASR Lecture 3 HMM Algorithms 16

Page 26: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Viterbi algorithm

Instead of summing over all possible state sequences, justconsider just the most probable path:

P∗(X |M) = maxQ∈Q

P(X ,Q|M)

Achieve this by changing the summation to a maximisation inthe forward algorithm recursion:

Vj(t) = maxi

Vi (t − 1)aijbj(xt)

If we are performing decoding or forced alignment, then onlythe most likely path is needed

In training, it can be used as an approximation

We need to keep track of the states that make up this path bykeeping a sequence of backpointers to enable a Viterbibacktrace: the backpointer for each state at each timeindicates the previous state on the most probable path

ASR Lecture 3 HMM Algorithms 16

Page 27: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Viterbi algorithm

Instead of summing over all possible state sequences, justconsider just the most probable path:

P∗(X |M) = maxQ∈Q

P(X ,Q|M)

Achieve this by changing the summation to a maximisation inthe forward algorithm recursion:

Vj(t) = maxi

Vi (t − 1)aijbj(xt)

If we are performing decoding or forced alignment, then onlythe most likely path is needed

In training, it can be used as an approximation

We need to keep track of the states that make up this path bykeeping a sequence of backpointers to enable a Viterbibacktrace: the backpointer for each state at each timeindicates the previous state on the most probable path

ASR Lecture 3 HMM Algorithms 16

Page 28: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Viterbi algorithm

Instead of summing over all possible state sequences, justconsider just the most probable path:

P∗(X |M) = maxQ∈Q

P(X ,Q|M)

Achieve this by changing the summation to a maximisation inthe forward algorithm recursion:

Vj(t) = maxi

Vi (t − 1)aijbj(xt)

If we are performing decoding or forced alignment, then onlythe most likely path is needed

In training, it can be used as an approximation

We need to keep track of the states that make up this path bykeeping a sequence of backpointers to enable a Viterbibacktrace: the backpointer for each state at each timeindicates the previous state on the most probable path

ASR Lecture 3 HMM Algorithms 16

Page 29: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Viterbi Recursion

Vj(t) = maxi

Vi (t − 1)aijbj(xt)

Likelihood of the most probable path

i i i

j j j

k k k

t-1 t t+1

aij

akj

ajj

Vj (t - 1)

Vi (t - 1)

Vk (t - 1)

max

Vj (t )

ASR Lecture 3 HMM Algorithms 17

Page 30: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Viterbi Recursion

Backpointers to the previous state on the most probable path

i i i

j j j

k k k

t-1 t t+1

aij

akj

ajj

Vj (t - 1)

Vi (t - 1)

Vk (t - 1)

Vj (t )

Bj (t ) = i

ASR Lecture 3 HMM Algorithms 18

Page 31: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

2. Decoding: The Viterbi algorithm

Initialisation

V0(0) = 1

Vj(0) = 0 if j 6= 0

Bj(0) = 0

Recursion

Vj(t) =J

maxi=1

Vi (t − 1)aijbj(xt)

Bj(t) = argJ

maxi=1

Vi (t − 1)aijbj(xt)

Termination

P∗ = VE =J

maxi=1

VT ( i )aiE

s∗T = BE = argJ

maxi=1

Vi (T )aiE

ASR Lecture 3 HMM Algorithms 19

Page 32: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

2. Decoding: The Viterbi algorithm

Initialisation

V0(0) = 1

Vj(0) = 0 if j 6= 0

Bj(0) = 0

Recursion

Vj(t) =J

maxi=1

Vi (t − 1)aijbj(xt)

Bj(t) = argJ

maxi=1

Vi (t − 1)aijbj(xt)

Termination

P∗ = VE =J

maxi=1

VT ( i )aiE

s∗T = BE = argJ

maxi=1

Vi (T )aiE

ASR Lecture 3 HMM Algorithms 19

Page 33: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

2. Decoding: The Viterbi algorithm

Initialisation

V0(0) = 1

Vj(0) = 0 if j 6= 0

Bj(0) = 0

Recursion

Vj(t) =J

maxi=1

Vi (t − 1)aijbj(xt)

Bj(t) = argJ

maxi=1

Vi (t − 1)aijbj(xt)

Termination

P∗ = VE =J

maxi=1

VT ( i )aiE

s∗T = BE = argJ

maxi=1

Vi (T )aiE

ASR Lecture 3 HMM Algorithms 19

Page 34: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Viterbi Backtrace

Backtrace to find the state sequence of the most probable path

Bi (t ) = ji i i

j j j

k k k

t-1 t t+1

Vj (t - 1)

Vi (t - 1)

Vk (t - 1)

Vj (t )

Bk (t + 1) = i

ASR Lecture 3 HMM Algorithms 20

Page 35: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

3. Training: Forward-Backward algorithm

Goal: Efficiently estimate the parameters of an HMM M froman observation sequence

Assume single Gaussian output probability distribution

bj(x) = p(x | j ) = N (x;µj ,Σj)

Parameters M:

Transition probabilities aij :∑j

aij = 1

Gaussian parameters for state j :mean vector µj ; covariance matrix Σj

ASR Lecture 3 HMM Algorithms 21

Page 36: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

3. Training: Forward-Backward algorithm

Goal: Efficiently estimate the parameters of an HMM M froman observation sequence

Assume single Gaussian output probability distribution

bj(x) = p(x | j ) = N (x;µj ,Σj)

Parameters M:

Transition probabilities aij :∑j

aij = 1

Gaussian parameters for state j :mean vector µj ; covariance matrix Σj

ASR Lecture 3 HMM Algorithms 21

Page 37: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

3. Training: Forward-Backward algorithm

Goal: Efficiently estimate the parameters of an HMM M froman observation sequence

Assume single Gaussian output probability distribution

bj(x) = p(x | j ) = N (x;µj ,Σj)

Parameters M:

Transition probabilities aij :∑j

aij = 1

Gaussian parameters for state j :mean vector µj ; covariance matrix Σj

ASR Lecture 3 HMM Algorithms 21

Page 38: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Viterbi Training

If we knew the state-time alignment, then each observationfeature vector could be assigned to a specific state

A state-time alignment can be obtained using the mostprobable path obtained by Viterbi decodingMaximum likelihood estimate of aij , if C ( i → j ) is the countof transitions from i to j

aij =C ( i → j )∑k C ( i → k )

Likewise if Zj is the set of observed acoustic feature vectorsassigned to state j , we can use the standard maximumlikelihood estimates for the mean and the covariance:

µj =

∑x∈Zj

x

|Zj |

Σj =

∑x∈Zj

(x − µj)(x − µj)T

|Zj |

ASR Lecture 3 HMM Algorithms 22

Page 39: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Viterbi Training

If we knew the state-time alignment, then each observationfeature vector could be assigned to a specific stateA state-time alignment can be obtained using the mostprobable path obtained by Viterbi decoding

Maximum likelihood estimate of aij , if C ( i → j ) is the countof transitions from i to j

aij =C ( i → j )∑k C ( i → k )

Likewise if Zj is the set of observed acoustic feature vectorsassigned to state j , we can use the standard maximumlikelihood estimates for the mean and the covariance:

µj =

∑x∈Zj

x

|Zj |

Σj =

∑x∈Zj

(x − µj)(x − µj)T

|Zj |

ASR Lecture 3 HMM Algorithms 22

Page 40: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Viterbi Training

If we knew the state-time alignment, then each observationfeature vector could be assigned to a specific stateA state-time alignment can be obtained using the mostprobable path obtained by Viterbi decodingMaximum likelihood estimate of aij , if C ( i → j ) is the countof transitions from i to j

aij =C ( i → j )∑k C ( i → k )

Likewise if Zj is the set of observed acoustic feature vectorsassigned to state j , we can use the standard maximumlikelihood estimates for the mean and the covariance:

µj =

∑x∈Zj

x

|Zj |

Σj =

∑x∈Zj

(x − µj)(x − µj)T

|Zj |

ASR Lecture 3 HMM Algorithms 22

Page 41: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Viterbi Training

If we knew the state-time alignment, then each observationfeature vector could be assigned to a specific stateA state-time alignment can be obtained using the mostprobable path obtained by Viterbi decodingMaximum likelihood estimate of aij , if C ( i → j ) is the countof transitions from i to j

aij =C ( i → j )∑k C ( i → k )

Likewise if Zj is the set of observed acoustic feature vectorsassigned to state j , we can use the standard maximumlikelihood estimates for the mean and the covariance:

µj =

∑x∈Zj

x

|Zj |

Σj =

∑x∈Zj

(x − µj)(x − µj)T

|Zj |ASR Lecture 3 HMM Algorithms 22

Page 42: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

EM Algorithm

Viterbi training is an approximation—we would like toconsider all possible paths

In this case rather than having a hard state-time alignment weestimate a probability

State occupation probability: The probability γj(t) ofoccupying state j at time t given the sequence ofobservations.Compare with component occupation probability in a GMM

We can use this for an iterative algorithm for HMM training:the EM algorithm (whose adaption to HMM is called ’Baum-Welch algorithm’)

Each iteration has two steps:

E-step estimate the state occupation probabilities(Expectation)

M-step re-estimate the HMM parameters based on theestimated state occupation probabilities(Maximisation)

ASR Lecture 3 HMM Algorithms 23

Page 43: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

EM Algorithm

Viterbi training is an approximation—we would like toconsider all possible paths

In this case rather than having a hard state-time alignment weestimate a probability

State occupation probability: The probability γj(t) ofoccupying state j at time t given the sequence ofobservations.Compare with component occupation probability in a GMM

We can use this for an iterative algorithm for HMM training:the EM algorithm (whose adaption to HMM is called ’Baum-Welch algorithm’)

Each iteration has two steps:

E-step estimate the state occupation probabilities(Expectation)

M-step re-estimate the HMM parameters based on theestimated state occupation probabilities(Maximisation)

ASR Lecture 3 HMM Algorithms 23

Page 44: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

EM Algorithm

Viterbi training is an approximation—we would like toconsider all possible paths

In this case rather than having a hard state-time alignment weestimate a probability

State occupation probability: The probability γj(t) ofoccupying state j at time t given the sequence ofobservations.Compare with component occupation probability in a GMM

We can use this for an iterative algorithm for HMM training:the EM algorithm (whose adaption to HMM is called ’Baum-Welch algorithm’)

Each iteration has two steps:

E-step estimate the state occupation probabilities(Expectation)

M-step re-estimate the HMM parameters based on theestimated state occupation probabilities(Maximisation)

ASR Lecture 3 HMM Algorithms 23

Page 45: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

EM Algorithm

Viterbi training is an approximation—we would like toconsider all possible paths

In this case rather than having a hard state-time alignment weestimate a probability

State occupation probability: The probability γj(t) ofoccupying state j at time t given the sequence ofobservations.Compare with component occupation probability in a GMM

We can use this for an iterative algorithm for HMM training:the EM algorithm (whose adaption to HMM is called ’Baum-Welch algorithm’)

Each iteration has two steps:

E-step estimate the state occupation probabilities(Expectation)

M-step re-estimate the HMM parameters based on theestimated state occupation probabilities(Maximisation)

ASR Lecture 3 HMM Algorithms 23

Page 46: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

EM Algorithm

Viterbi training is an approximation—we would like toconsider all possible paths

In this case rather than having a hard state-time alignment weestimate a probability

State occupation probability: The probability γj(t) ofoccupying state j at time t given the sequence ofobservations.Compare with component occupation probability in a GMM

We can use this for an iterative algorithm for HMM training:the EM algorithm (whose adaption to HMM is called ’Baum-Welch algorithm’)

Each iteration has two steps:

E-step estimate the state occupation probabilities(Expectation)

M-step re-estimate the HMM parameters based on theestimated state occupation probabilities(Maximisation)

ASR Lecture 3 HMM Algorithms 23

Page 47: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

EM Algorithm

Viterbi training is an approximation—we would like toconsider all possible paths

In this case rather than having a hard state-time alignment weestimate a probability

State occupation probability: The probability γj(t) ofoccupying state j at time t given the sequence ofobservations.Compare with component occupation probability in a GMM

We can use this for an iterative algorithm for HMM training:the EM algorithm (whose adaption to HMM is called ’Baum-Welch algorithm’)

Each iteration has two steps:

E-step estimate the state occupation probabilities(Expectation)

M-step re-estimate the HMM parameters based on theestimated state occupation probabilities(Maximisation)

ASR Lecture 3 HMM Algorithms 23

Page 48: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

EM Algorithm

Viterbi training is an approximation—we would like toconsider all possible paths

In this case rather than having a hard state-time alignment weestimate a probability

State occupation probability: The probability γj(t) ofoccupying state j at time t given the sequence ofobservations.Compare with component occupation probability in a GMM

We can use this for an iterative algorithm for HMM training:the EM algorithm (whose adaption to HMM is called ’Baum-Welch algorithm’)

Each iteration has two steps:

E-step estimate the state occupation probabilities(Expectation)

M-step re-estimate the HMM parameters based on theestimated state occupation probabilities(Maximisation)

ASR Lecture 3 HMM Algorithms 23

Page 49: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Backward probabilities

To estimate the state occupation probabilities it is useful todefine (recursively) another set of probabilities—the Backwardprobabilities

βj(t) = p(xt+1, . . . , xT |qt = j ,M)

The probability of future observations given a the HMM is instate j at time t

These can be recursively computed (going backwards in time)

Initialisationβi (T ) = aiE

Recursion

βi (t) =J∑

j=1

aijbj(xt+1)βj(t + 1) for t = T−1, . . . , 1

Termination

p(X |M) = β0(0) =J∑

j=1

a0jbj(x1)βj(1) = αE

ASR Lecture 3 HMM Algorithms 24

Page 50: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Backward probabilities

To estimate the state occupation probabilities it is useful todefine (recursively) another set of probabilities—the Backwardprobabilities

βj(t) = p(xt+1, . . . , xT |qt = j ,M)

The probability of future observations given a the HMM is instate j at time tThese can be recursively computed (going backwards in time)

Initialisationβi (T ) = aiE

Recursion

βi (t) =J∑

j=1

aijbj(xt+1)βj(t + 1) for t = T−1, . . . , 1

Termination

p(X |M) = β0(0) =J∑

j=1

a0jbj(x1)βj(1) = αE

ASR Lecture 3 HMM Algorithms 24

Page 51: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Backward probabilities

To estimate the state occupation probabilities it is useful todefine (recursively) another set of probabilities—the Backwardprobabilities

βj(t) = p(xt+1, . . . , xT |qt = j ,M)

The probability of future observations given a the HMM is instate j at time tThese can be recursively computed (going backwards in time)

Initialisationβi (T ) = aiE

Recursion

βi (t) =J∑

j=1

aijbj(xt+1)βj(t + 1) for t = T−1, . . . , 1

Termination

p(X |M) = β0(0) =J∑

j=1

a0jbj(x1)βj(1) = αE

ASR Lecture 3 HMM Algorithms 24

Page 52: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Backward probabilities

To estimate the state occupation probabilities it is useful todefine (recursively) another set of probabilities—the Backwardprobabilities

βj(t) = p(xt+1, . . . , xT |qt = j ,M)

The probability of future observations given a the HMM is instate j at time tThese can be recursively computed (going backwards in time)

Initialisationβi (T ) = aiE

Recursion

βi (t) =J∑

j=1

aijbj(xt+1)βj(t + 1) for t = T−1, . . . , 1

Termination

p(X |M) = β0(0) =J∑

j=1

a0jbj(x1)βj(1) = αE

ASR Lecture 3 HMM Algorithms 24

Page 53: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Backward probabilities

To estimate the state occupation probabilities it is useful todefine (recursively) another set of probabilities—the Backwardprobabilities

βj(t) = p(xt+1, . . . , xT |qt = j ,M)

The probability of future observations given a the HMM is instate j at time tThese can be recursively computed (going backwards in time)

Initialisationβi (T ) = aiE

Recursion

βi (t) =J∑

j=1

aijbj(xt+1)βj(t + 1) for t = T−1, . . . , 1

Termination

p(X |M) = β0(0) =J∑

j=1

a0jbj(x1)βj(1) = αE

ASR Lecture 3 HMM Algorithms 24

Page 54: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Backward Recursion

βj(t) = p(xt+1, . . . , xT |qt = j ,M) =J∑

j=1

aijbj(xt+1)βj(t + 1)

βj (t )

i i i

j j j

k k k

t-1 t t+1

aji

ajk

ajj

βi (t + 1) ∑

βj (t + 1)

βk (t + 1)

ASR Lecture 3 HMM Algorithms 25

Page 55: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

State Occupation Probability

The state occupation probability γj(t) is the probability ofoccupying state j at time t given the sequence of observationsExpress in terms of the forward and backward probabilities:

γj(t) = P(qt = j |X,M) =1

αEαj( t )βj( t )

recalling that p(X |M) = αE

Since

αj(t)βj(t) = p(x1, . . . , xt , qt = j |M)

p(xt+1, . . . , xT |qt = j ,M)

= p(x1, . . . , xt , xt+1, . . . , xT , qt = j |M)

= p(X, qt = j |M)

P(qt = j |X,M) =p(X, qt = j |M)

p(X |M)

ASR Lecture 3 HMM Algorithms 26

Page 56: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Re-estimation of Gaussian parameters

The sum of state occupation probabilities through time for astate, may be regarded as a “soft” count

We can use this “soft” alignment to re-estimate the HMMparameters:

µj =

∑Tt=1 γj(t)x t∑Tt=1 γj(t)

Σj =

∑Tt=1 γj(t)(x t − µj)(x − µj)

T∑Tt=1 γj(t)

ASR Lecture 3 HMM Algorithms 27

Page 57: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Re-estimation of transition probabilities

Similarly to the state occupation probability, we can estimateξi ,j(t), the probability of being in i at time t and j at t + 1,given the observations:

ξt( i , j ) = P(qt = i , qt+1 = j |X,M)

=p(qt = i , qt+1 = j ,X |M)

p(X |M)

=αi (t)aijbj(xt+1)βj(t + 1)

αE

We can use this to re-estimate the transition probabilities

aij =

∑Tt=1 ξi ,j(t)∑J

k=1

∑Tt=1 ξi ,k(t)

ASR Lecture 3 HMM Algorithms 28

Page 58: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Pulling it all together

Iterative estimation of HMM parameters using the EMalgorithm. At each iteration

E step For all time-state pairs1 Recursively compute the forward probabilitiesαj(t) and backward probabilities βj( t )

2 Compute the state occupation probabilitiesγj(t) and ξi,j(t)

M step Based on the estimated state occupationprobabilities re-estimate the HMM parameters:mean vectors µj , covariance matrices Σj andtransition probabilities aij

The application of the EM algorithm to HMM training issometimes called the Forward-Backward algorithm orBaum-Welch algorithm

ASR Lecture 3 HMM Algorithms 29

Page 59: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Extension to a corpus of utterances

We usually train from a large corpus of R utterances

If xrt is the t th frame of the r th utterance Xr then we cancompute the probabilities αr

j ( t ), βrj ( t ), γrj (t) and ξri ,j(t) asbefore

The re-estimates are as before, except we must sum over theR utterances, eg:

µj =

∑Rr=1

∑Tt=1 γ

rj (t)x r

t∑Rr=1

∑Tt=1 γ

rj (t)

In addition, we usually employ “embedded training”, in whichfine tuning of phone labelling with “forced Viterbi alignment”or forced alignment is involved. (For details see Section 9.7 inJurafsky and Martin’s SLP)

ASR Lecture 3 HMM Algorithms 30

Page 60: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Extension to a corpus of utterances

We usually train from a large corpus of R utterances

If xrt is the t th frame of the r th utterance Xr then we cancompute the probabilities αr

j ( t ), βrj ( t ), γrj (t) and ξri ,j(t) asbefore

The re-estimates are as before, except we must sum over theR utterances, eg:

µj =

∑Rr=1

∑Tt=1 γ

rj (t)x r

t∑Rr=1

∑Tt=1 γ

rj (t)

In addition, we usually employ “embedded training”, in whichfine tuning of phone labelling with “forced Viterbi alignment”or forced alignment is involved. (For details see Section 9.7 inJurafsky and Martin’s SLP)

ASR Lecture 3 HMM Algorithms 30

Page 61: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Extension to Gaussian mixture model (GMM)

The assumption of a Gaussian distribution at each state isvery strong; in practice the acoustic feature vectors associatedwith a state may be strongly non-Gaussian

In this case an M-component Gaussian mixture model is anappropriate density function:

bj(x) = p(x |q = j) =M∑

m=1

cjmN (x;µjm,Σjm)

Given enough components, this family of functions can modelany distribution.

Train using the EM algorithm, in which the componentestimation probabilities are estimated in the E-step

ASR Lecture 3 HMM Algorithms 31

Page 62: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

EM training of HMM/GMM

Rather than estimating the state-time alignment, we estimatethe component/state-time alignment, and component-stateoccupation probabilities γjm(t): the probability of occupyingmixture component m of state j at time t.(ξtm(j) in Jurafsky and Martin’s SLP)

We can thus re-estimate the mean of mixture component mof state j as follows

µjm =

∑Tt=1 γjm(t)x t∑Tt=1 γjm(t)

And likewise for the covariance matrices (mixture modelsoften use diagonal covariance matrices)The mixture coefficients are re-estimated in a similar way totransition probabilities:

cjm =

∑Tt=1 γjm(t)∑M

m′=1

∑Tt=1 γjm′(t)

ASR Lecture 3 HMM Algorithms 32

Page 63: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Doing the computation

The forward, backward and Viterbi recursions result in a longsequence of probabilities being multiplied

This can cause floating point underflow problems

In practice computations are performed in the log domain (inwhich multiplies become adds)

Working in the log domain also avoids needing to perform theexponentiation when computing Gaussians

ASR Lecture 3 HMM Algorithms 33

Page 64: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

Summary: HMMs

HMMs provide a generative model for statistical speechrecognition

Three key problems1 Computing the overall likelihood: the Forward algorithm2 Decoding the most likely state sequence: the Viterbi algorithm3 Estimating the most likely parameters: the EM

(Forward-Backward) algorithm

Solutions to these problems are tractable due to the two keyHMM assumptions

1 Conditional independence of observations given the currentstate

2 Markov assumption on the states

ASR Lecture 3 HMM Algorithms 34

Page 65: HMM Algorithms - School of Informatics...an HMM, determine the most probable hidden state sequence!the Viterbi algorithm 3 Training Given an observation sequence and an HMM, learn

References: HMMs

Gales and Young (2007). “The Application of Hidden MarkovModels in Speech Recognition”, Foundations and Trends inSignal Processing, 1 (3), 195–304: section 2.2.

Jurafsky and Martin (2008). Speech and Language Processing(2nd ed.): sections 6.1–6.5; 9.2; 9.4. (Errata athttp://www.cs.colorado.edu/~martin/SLP/Errata/

SLP2-PIEV-Errata.html)

Rabiner and Juang (1989). “An introduction to hiddenMarkov models”, IEEE ASSP Magazine, 3 (1), 4–16.

Renals and Hain (2010). “Speech Recognition”,Computational Linguistics and Natural Language ProcessingHandbook, Clark, Fox and Lappin (eds.), Blackwells.

ASR Lecture 3 HMM Algorithms 35