parameter estimation for hmm lecture #7

46
. Parameter Estimation For HMM Lecture #7 Background Readings : Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

Upload: haviva-ross

Post on 31-Dec-2015

53 views

Category:

Documents


8 download

DESCRIPTION

Parameter Estimation For HMM Lecture #7. Background Readings : Chapter 3.3 in the text book, Biological Sequence Analysis , Durbin et al., 2001. M. M. M. M. S 1. S 2. S L-1. S L. T. T. T. T. x 1. x 2. X L-1. x L. Reminder: Hidden Markov Model. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parameter Estimation For HMM  Lecture #7

.

Parameter Estimation For HMM Lecture #7

Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

Fall04: One hour was upto 23 (maybe a bit tight?). 2nd hour upto end (45). maybe hidden slides should be revealed.Fall05: first hour up to 24 (inc), finished 10 minutes before time at slide 45. After the lecture I added slides 46-51 - next time they could be included.
Page 2: Parameter Estimation For HMM  Lecture #7

2

Reminder: Hidden Markov Model

Markov Chain transition probabilities: p(Si+1= t|Si = s) = ast

Emission probabilities: p(Xi = b| Si = s) = es(b)

S1 S2 SL-1 SL

x1 x2 XL-1 xL

M M M M

TTTT

11 11

( , ) ( , , ; ,..., ) ( )i i i

L

L L s s s ii

p p s s x x a e xs x

For each integer L>0, this defines a probability space in which the simple events are HMM of length L.

Note: for a given L, An HMM (x,s) of length<L is a compound event: all sequences of length L which have (s,x) as a prefix.
Page 3: Parameter Estimation For HMM  Lecture #7

3

Hidden Markov Model for CpG Islands

The states: Domain(Si)={+, -} {A,C,T,G} (8 values)

In this representation P(xi| si) = 0 or 1 depending on whether xi is consistent with si . E.g. xi= G is consistent with si=(+,G) and with si=(-,G) but not with any other state of si.

A- T+

A T

G +

G

… …… …

Page 4: Parameter Estimation For HMM  Lecture #7

4

Reminder: Most Probable state pathS1 S2 SL-1 SL

x1 x2 XL-1 xL

M M M M

TTTT

Given an output sequence x = (x1,…,xL),

A most probable path s*= (s*1,…,s*

L) is one which maximizes p(s|x).

1( ,..., )

* *1 1 1* ( ,..., ) ( ,..., | ,..., )argmax

Ls s

L L Ls s s p s s x x

Page 5: Parameter Estimation For HMM  Lecture #7

6

A- C- T- T+

A C T T

G +

G

Predicting CpG islands via most probable path:

Output symbols: A, C, G, T (4 letters). Markov Chain states: 4 “-” states and 4 “+” states, two for each letter (8 states total).The most probable path (found by Viterbi’s algorithm) predicts CpG islands.

Experiment (Durbin et al, p. 60-61) shows that the predicted islands are shorter than the assumed ones. In addition quite a few “false negatives” are found.

Page 6: Parameter Estimation For HMM  Lecture #7

7

Reminder: Most probable state

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Given an output sequence x = (x1,…,xL), si is a most probable state (at location i) if:

si=argmaxk p(Si=k |x).

),..,|(),..,(

),..,|(),..,|( Lii

L

LiiLii xxsSp

xxp

xxsSpxxsSp 1

1

11

Page 7: Parameter Estimation For HMM  Lecture #7

9

Finding the probability that a letter is in a CpG island via the algorithm for most probable state:

The probability that the occurrence of G in the i-th location is in a CpG island (+ state) is:

∑s+ p(Si =s+ |x) = ∑s+ F(Si=s+ )B(Si=s+

)

Where the summation is formally over the 4 “+” states, but actually

only state G+ need to be considered (why?)

A- C- T- T+

A C T T

G +

G

i

Page 8: Parameter Estimation For HMM  Lecture #7

10

Parameter Estimation for HMM

An HMM model is defined by the parameters: akl and ek(b), for all states k,l and all symbols b. Let θ denote the collection of these parameters.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

lk

b

akl

ek(b)

Page 9: Parameter Estimation For HMM  Lecture #7

11

Parameter Estimation for HMM

To determine the values of (the parameters in) θ, use a training set = {x1,...,xn}, where each xj is a sequence which is assumed to fit the model.Given the parameters θ, each sequence xj has an assigned probability p(xj|θ).

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 10: Parameter Estimation For HMM  Lecture #7

12

Maximum Likelihood Parameter Estimation for HMM

The elements of the training set {x1,...,xn}, are assumed to be independent, p(x1,..., xn|θ) = ∏j p (xj|θ).

ML parameter estimation looks for θ which maximizes the above.

The exact method for finding or approximating this θ depends on the nature of the training set used.

Page 11: Parameter Estimation For HMM  Lecture #7

13

Data for HMM

Possible properties of (the sequences in) the training set:1.For each xj, the information on the states sj

i (the symbols

xji are usually known).

2.The size (number of sequences) of the training set

S1 S2 SL-1 SL

x1 x2 XL-1 xL

M M M M

TTTT

Page 12: Parameter Estimation For HMM  Lecture #7

14

Case 1: ML when Sequences are fully known

We know the complete structure of each sequence in the training set {x1,...,xn}. We wish to estimate akl and ek(b) for all pairs of states k, l and symbols b.

By the ML method, we look for parameters θ* which maximize the probability of the sample set: p(x1,...,xn| θ*) =MAXθ p(x1,...,xn| θ).

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 13: Parameter Estimation For HMM  Lecture #7

15

Case 1: Sequences are fully known

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Let mkl= |{i: si-1=k,si=l}| (in xj). mk(b)=|{i:si=k,xi=b}| (in xj).

( )

( , ) ( , )

then: ( | ) [ ( )]kl km m bkkl

k l k b

jp a e bx

For each xj we have:

j

iii

L

i

jisss

j xeaxp1

1)()|(

Page 14: Parameter Estimation For HMM  Lecture #7

16

Case 1 (cont)

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

By the independence of the xj’s, p(x1,...,xn| θ)=∏jp(xj|θ).

Thus, if Akl = #(transitions from k to l) in the training set, and Ek(b) = #(emissions of symbol b from state k) in the training set, we have:

( )

( , ) ( , )

1 ( | ) [ ( )],.., kl kA E bkkl

k l k b

np a e bx x

Page 15: Parameter Estimation For HMM  Lecture #7

17

Case 1 (cont)

( )

( , ) ( , )

[ ( )]kl kA E bkkl

k l k b

a e b

So we need to find akl’s and ek(b)’s which maximize:

Subject to:

])(,

)(,

0 [

1 and 1 states allFor

bea

beak

kkl

l bkkl

Page 16: Parameter Estimation For HMM  Lecture #7

18

Case 1 (cont)

( )

( , ) ( , )

( )

[ ( )]

[ ( )][ ] [ ]

kl k

kl k

A E bkkl

k l k b

A E bkkl

k l k b

F a e b

a e b

kb

Subject to: for all , 1, and e ( ) 1.kll

k a b

Rewriting, we need to maximize:

Page 17: Parameter Estimation For HMM  Lecture #7

19

Case 1 (cont)

If we maximize for each : s.t. 1klAklkl

ll

k a a ( )and also [ ( )] s.t. ( ) 1kE b

k kbb

e b e b

Then we will maximize also F.Each of the above is a simpler ML problem, which is similar to ML parameters estimation for a die, treated next.

Page 18: Parameter Estimation For HMM  Lecture #7

20

ML parameters estimation for a die

Let X be a random variable with 6 values x1,…,x6

denoting the six outcomes of a die. Here the parameters areθ ={1,2,3,4,5, 6} , ∑θi=1Assume that the data is one sequence:

Data = (x6,x1,x1,x3,x2,x2,x3,x4,x5,x2,x6)So we have to maximize

2 3 2 21 2 3 4 5 6( | )P Data

Subject to: θ1+θ2+ θ3+ θ4+ θ5+ θ6=1 [and θi 0 ]

252 3 2

1 2 3 4 51

i.e., ( | ) 1 ii

P Data

Page 19: Parameter Estimation For HMM  Lecture #7

23

Maximum Likelihood EstimateBy the ML approach, we look for parameters that maximizes the probability of data (i.e., the likelihood function ).We will find the parameters by considering the corresponding log-likelihood function:

63 51 2 4

551 2 3 4 1

log ( | ) log 1N

N NN N Nii

P Data

5

16

5

11loglog

i ii ii NN

A necessary condition for (local) maximum is:

01

)|(log5

1

6

θ

i ij

j

j

NNDataP

Page 20: Parameter Estimation For HMM  Lecture #7

24

Finding the Maximum

Rearranging terms:

ii

jj N

N Divide the jth equation by the ith

equation:

Sum from j=1 to 6:

ii

ii

j j

jj N

N

N

N

6

16

1

1

So there is only one local – and hence global – maximum. Hence the MLE is given by:

6,..,1 iN

N ii

6

65

1

6

1 NNN

i ij

j

Page 21: Parameter Estimation For HMM  Lecture #7

25

Generalization for distribution with any number n of outcomes

Let X be a random variable with n values x1,…,xk denoting the k outcomes of Independently and Identically Distributed experiments, with parameters θ ={1,2,...,k} (θi is the probability of xi). Again, the data is one sequence of length n:

Data = (xi1,xi2

,...,xin)

Then we have to maximize

1 211 2( | ) , ( ... )knn n

kkP Data n n n

Subject to: θ1+θ2+ ....+ θk=1

11

1

1 11

i.e., ( | ) 1k

k

nknn

iki

P Data

Page 22: Parameter Estimation For HMM  Lecture #7

26

Generalization for n outcomes (cont)

i k

i k

n n

By treatment identical to the die case, the maximum is obtained when for all i:

Hence the MLE is given by the relative frequencies:

1,..,ii

ni k

n

Page 23: Parameter Estimation For HMM  Lecture #7

27

Fractional Exponents

Some models allow ni’s to be fractions (eg, if we are uncertain of a die outcome, we may consider it “6” with 20% confidence and “5” with 80%):We still can have

And the same analysis yields:

1,..,ii

ni k

n

1 211 2( | ) , ( ... )knn n

kkP Data n n n

Page 24: Parameter Estimation For HMM  Lecture #7

28

Apply the ML method to HMM

Let Akl = #(transitions from k to l) in the training set.Ek(b) = #(emissions of symbol b from state k) in the training set. We need to:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

( )

( , ) ( , )

Maximize [ ( )]kl kA E bkkl

k l k b

a e b

k Subject to: for all states , 1, and e ( ) 1, , ( ) 0.kl kl kl b

k a b a e b

Page 25: Parameter Estimation For HMM  Lecture #7

29

Apply to HMM (cont.)

We apply the previous technique to get for each k the parameters {akl|l=1,..,m} and {ek(b)|bΣ}:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

'' '

( ) , and ( )

( ')kl k

kl kkl kl b

A E ba e b

A E b

Which gives the optimal ML parameters

Page 26: Parameter Estimation For HMM  Lecture #7

30

Summary of Case 1: Sequences are fully known

We know the complete structure of each sequence in the training set {x1,...,xn}. We wish to estimate akl and ek(b) for all pairs of states k, l and symbols b.

Summary: when everything is known, we can find the (unique set of) parameters θ* which maximizes

p(x1,...,xn| θ*) =MAXθ p(x1,...,xn| θ).

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 27: Parameter Estimation For HMM  Lecture #7

31

Adding pseudo counts in HMM

If the sample set is too small, we may get a biased result. In this case we modify the actual count by our prior knowledge/belief: rkl is our prior belief and transitions from k to l.rk(b) is our prior belief on emissions of b from state k.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

' '' '

( ) ( )then , and ( )

( ) ( ( ') ( ))kl kl k k

kl kkl kl k kl b

A r E b r ba e b

A r E b r b

Page 28: Parameter Estimation For HMM  Lecture #7

32

Case 2: State paths are unknown

For a given θ we have: p(x1,..., xn|θ)= p(x1| θ) p (xn|θ)

(since the xj are independent)

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

For each sequence x,p(x|θ)=∑s p(x,s|θ),

The sum taken over all state paths s which emit x.

Page 29: Parameter Estimation For HMM  Lecture #7

33

Case 2: State paths are unknown

Thus, for the n sequences (x1,..., xn) we have:

p(x1,..., xn |θ)= ∑ p(x1,..., xn , s1,..., sn |θ),

Where the summation is taken over all tuples of n state paths (s1,..., sn ) which generate (x1,..., xn ) .For simplicity, we will assume that n=1.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

(s1,..., sn )

Page 30: Parameter Estimation For HMM  Lecture #7

34

Case 2: State paths are unknown

So we need to maximize p(x|θ)=∑s p(x,s|θ), where the summation is over all the sequences S which produce the output sequence x. Finding θ* which maximizes ∑s p(x,s|θ) is hard. [Unlike finding θ* which maximizes p(x,s|θ) for a single sequence (x,s).]

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 31: Parameter Estimation For HMM  Lecture #7

35

ML Parameter Estimation for HMM

The general process for finding θ in this case is1. Start with an initial value of θ.2. Find θ’ so that p(x|θ’) > p(x|θ) 3. set θ = θ’.4. Repeat until some convergence criterion is met.

A general algorithm of this type is the Expectation Maximization algorithm, which we will meet later. For the specific case of HMM, it is the Baum-Welch training.

Page 32: Parameter Estimation For HMM  Lecture #7

36

Baum Welch training

We start with some values of akl and ek(b), which define prior values of θ. Then we use an iterative algorithm which attempts to replace θ by a θ* s.t.

p(x|θ*) > p(x|θ)This is done by “imitating” the algorithm for Case 1, where all states are known:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Page 33: Parameter Estimation For HMM  Lecture #7

37

Baum Welch training

In case 1 we computed the optimal values of akl and ek(b), (for the optimal θ) by simply counting the number Akl of transitions from state k to state l, and the number Ek(b) of emissions of symbol b from state k, in the training set. This was possible since we knew all the states.

Si= lSi-1= k

xi-1= b

… …

xi= c

Page 34: Parameter Estimation For HMM  Lecture #7

38

Baum Welch training

When the states are unknown, the counting process is replaced by averaging process:For each edge si-1 si we compute the average number of “k to l” transitions, for all possible pairs (k,l), over this edge. Then, for each k and l, we take Akl to be the sum over all edges.

Si= ?Si-1= ?

xi-1= b xi= c

… …

Page 35: Parameter Estimation For HMM  Lecture #7

39

Baum Welch training

Similarly, For each edge si b and each state k, we compute the average number of times that si=k, which is the expected number of “k → b” transmission on this edge. Then we take Ek(b) to be the sum over all such edges. These expected values are computed by assuming the current parameters θ:

Si = ?

xi-1= b

Page 36: Parameter Estimation For HMM  Lecture #7

40

Akl and Ek(b) when states are unknown

Akl and Ek(b) are computed according to the current distribution θ, that is:

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi

Akl=∑sAsklp(s|x,θ), where As

kl is the number of k to l transitions in the sequence s.Ek(b)=∑sEs

k (b)p(s|x,θ), where Esk (b) is the number of times k

emits b in the sequence s with output x.

Page 37: Parameter Estimation For HMM  Lecture #7

41

Baum Welch: step 1a Count expected number of state transitions

For each i, k,l, compute the state transitions probabilities by the current θ:

s1 SisL

X1 Xi XL

Si-1

Xi-1.. ..

P(si-1=k, si=l | x,θ)For this, we use the forwards and backwards algorithms

Page 38: Parameter Estimation For HMM  Lecture #7

43

Baum Welch: Step 1a (cont)

Claim: By the probability distribution of HMM

s1 SisL

X1 Xi XL

Si-1

Xi-1.. ..

1

1( ) ( ) ( )( , | , )

( | )k kl l i l

i i

F i a e x B ip s k s l x

p x

(akl and el(xi) are the parameters defined by , and Fk(i-1), Bk(i) are the forward and backward algorithms)

Page 39: Parameter Estimation For HMM  Lecture #7

44

Step 1a: Computing P(si-1=k, si=l | x,θ)

P(x1,…,xL,si-1=k,si=l|) = P(x1,…,xi-1,si-1=k|) aklel(xi ) P(xi+1,…,xL |si=l,)

= Fk(i-1) aklel(xi ) Bl(i)

Via the forward algorithm

Via the backward algorithm

s1 s2 sL-1 sL

X1 X2 XL-1 XL

Si-1

Xi-1

si

Xi

x

p(si-1=k,si=l | x, ) = Fk(i-1) aklel(xi ) Bl(i)

)|( xp

Page 40: Parameter Estimation For HMM  Lecture #7

45

Step 1a (end)

For each pair (k,l), compute the expected number of state transitions from k to l, as the sum of the expected number of k to l transitions over all L edges :

1

1

1

1

11

( , , | )( | )

( ) ( ) ( )( | )

L

kl i i

i

L

k kl l i l

i

A p s k s l xp x

F i a e x B ip x

Page 41: Parameter Estimation For HMM  Lecture #7

46

Step 1a for many sequences:

Exercise: Prove that when we have n independent input sequences (x1,..., xn ), then Akl is given by:

11 1

1 1

1( = , = , )

( )

1( 1) ( ) ( )

( )

|n L

kl i ijj i

n Lj j

kl l ik ljj i

jA p s k s lp x

f i a e x b ip x

x

where and are the forward and backward algorithms

for under .

j jk lf b

jx

Page 42: Parameter Estimation For HMM  Lecture #7

47

Baum-Welch: Step 1b count expected number of symbols emissions

for state k and each symbol b, for each i where Xi=b, compute the expected number of times that Xi=b.

s1 s2 sL-1 sL

X1 X2 XL-1 XL

si

Xi=b

),...,(/)()(

),...(/),...(

),...|(

Likik

LiL

Li

xxpsbsf

xxpksxxp

xxksp

1

11

1

Page 43: Parameter Estimation For HMM  Lecture #7

48

Baum-Welch: Step 1b

For each state k and each symbol b, compute the expected number of emissions of b from k as the sum of the expected number of times that si = k, over all i’s for which xi = b.

1

:

( ) ( ) ( )( | )

i

k k k

i x b

E b F i B ip x

Page 44: Parameter Estimation For HMM  Lecture #7

49

Step 1b for many sequences

Exercise: when we have n sequences (x1,..., xn ), the expected number of emissions of b from k is given by:

1 :

1( ) ( ) ( )

( ) ji

nj j

k k kjj i x b

E b F i B ip x

Page 45: Parameter Estimation For HMM  Lecture #7

50

Summary of Steps 1a and 1b: the E part of the Baum Welch training

These steps compute the expected numbers Akl of k,l transitions for all pairs of states k and l, and the expected numbers Ek(b) of transmitions of symbol b from state k, for all states k and symbols b.

The next step is the M step, which is identical to the computation of optimal ML parameters when all states are known.

Page 46: Parameter Estimation For HMM  Lecture #7

51

Baum-Welch: step 2

'' '

( ) , and ( )

( ')kl k

kl kkl kl b

A E ba e b

A E b

Use the Akl’s, Ek(b)’s to compute the new values of akl and ek(b). These values define θ*.

The correctness of the EM algorithm implies that: p(x1,..., xn|θ*) p(x1,..., xn|θ)

i.e, θ* increases the probability of the data

This procedure is iterated, until some convergence criterion is met.