parameter estimation for hmm lecture #7
DESCRIPTION
Parameter Estimation For HMM Lecture #7. Background Readings : Chapter 3.3 in the text book, Biological Sequence Analysis , Durbin et al., 2001. M. M. M. M. S 1. S 2. S L-1. S L. T. T. T. T. x 1. x 2. X L-1. x L. Reminder: Hidden Markov Model. - PowerPoint PPT PresentationTRANSCRIPT
.
Parameter Estimation For HMM Lecture #7
Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
2
Reminder: Hidden Markov Model
Markov Chain transition probabilities: p(Si+1= t|Si = s) = ast
Emission probabilities: p(Xi = b| Si = s) = es(b)
S1 S2 SL-1 SL
x1 x2 XL-1 xL
M M M M
TTTT
11 11
( , ) ( , , ; ,..., ) ( )i i i
L
L L s s s ii
p p s s x x a e xs x
For each integer L>0, this defines a probability space in which the simple events are HMM of length L.
3
Hidden Markov Model for CpG Islands
The states: Domain(Si)={+, -} {A,C,T,G} (8 values)
In this representation P(xi| si) = 0 or 1 depending on whether xi is consistent with si . E.g. xi= G is consistent with si=(+,G) and with si=(-,G) but not with any other state of si.
A- T+
A T
G +
G
… …… …
4
Reminder: Most Probable state pathS1 S2 SL-1 SL
x1 x2 XL-1 xL
M M M M
TTTT
Given an output sequence x = (x1,…,xL),
A most probable path s*= (s*1,…,s*
L) is one which maximizes p(s|x).
1( ,..., )
* *1 1 1* ( ,..., ) ( ,..., | ,..., )argmax
Ls s
L L Ls s s p s s x x
6
A- C- T- T+
A C T T
G +
G
Predicting CpG islands via most probable path:
Output symbols: A, C, G, T (4 letters). Markov Chain states: 4 “-” states and 4 “+” states, two for each letter (8 states total).The most probable path (found by Viterbi’s algorithm) predicts CpG islands.
Experiment (Durbin et al, p. 60-61) shows that the predicted islands are shorter than the assumed ones. In addition quite a few “false negatives” are found.
7
Reminder: Most probable state
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
Given an output sequence x = (x1,…,xL), si is a most probable state (at location i) if:
si=argmaxk p(Si=k |x).
),..,|(),..,(
),..,|(),..,|( Lii
L
LiiLii xxsSp
xxp
xxsSpxxsSp 1
1
11
9
Finding the probability that a letter is in a CpG island via the algorithm for most probable state:
The probability that the occurrence of G in the i-th location is in a CpG island (+ state) is:
∑s+ p(Si =s+ |x) = ∑s+ F(Si=s+ )B(Si=s+
)
Where the summation is formally over the 4 “+” states, but actually
only state G+ need to be considered (why?)
A- C- T- T+
A C T T
G +
G
i
10
Parameter Estimation for HMM
An HMM model is defined by the parameters: akl and ek(b), for all states k,l and all symbols b. Let θ denote the collection of these parameters.
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
lk
b
akl
ek(b)
11
Parameter Estimation for HMM
To determine the values of (the parameters in) θ, use a training set = {x1,...,xn}, where each xj is a sequence which is assumed to fit the model.Given the parameters θ, each sequence xj has an assigned probability p(xj|θ).
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
12
Maximum Likelihood Parameter Estimation for HMM
The elements of the training set {x1,...,xn}, are assumed to be independent, p(x1,..., xn|θ) = ∏j p (xj|θ).
ML parameter estimation looks for θ which maximizes the above.
The exact method for finding or approximating this θ depends on the nature of the training set used.
13
Data for HMM
Possible properties of (the sequences in) the training set:1.For each xj, the information on the states sj
i (the symbols
xji are usually known).
2.The size (number of sequences) of the training set
S1 S2 SL-1 SL
x1 x2 XL-1 xL
M M M M
TTTT
14
Case 1: ML when Sequences are fully known
We know the complete structure of each sequence in the training set {x1,...,xn}. We wish to estimate akl and ek(b) for all pairs of states k, l and symbols b.
By the ML method, we look for parameters θ* which maximize the probability of the sample set: p(x1,...,xn| θ*) =MAXθ p(x1,...,xn| θ).
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
15
Case 1: Sequences are fully known
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
Let mkl= |{i: si-1=k,si=l}| (in xj). mk(b)=|{i:si=k,xi=b}| (in xj).
( )
( , ) ( , )
then: ( | ) [ ( )]kl km m bkkl
k l k b
jp a e bx
For each xj we have:
j
iii
L
i
jisss
j xeaxp1
1)()|(
16
Case 1 (cont)
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
By the independence of the xj’s, p(x1,...,xn| θ)=∏jp(xj|θ).
Thus, if Akl = #(transitions from k to l) in the training set, and Ek(b) = #(emissions of symbol b from state k) in the training set, we have:
( )
( , ) ( , )
1 ( | ) [ ( )],.., kl kA E bkkl
k l k b
np a e bx x
17
Case 1 (cont)
( )
( , ) ( , )
[ ( )]kl kA E bkkl
k l k b
a e b
So we need to find akl’s and ek(b)’s which maximize:
Subject to:
])(,
)(,
0 [
1 and 1 states allFor
bea
beak
kkl
l bkkl
18
Case 1 (cont)
( )
( , ) ( , )
( )
[ ( )]
[ ( )][ ] [ ]
kl k
kl k
A E bkkl
k l k b
A E bkkl
k l k b
F a e b
a e b
kb
Subject to: for all , 1, and e ( ) 1.kll
k a b
Rewriting, we need to maximize:
19
Case 1 (cont)
If we maximize for each : s.t. 1klAklkl
ll
k a a ( )and also [ ( )] s.t. ( ) 1kE b
k kbb
e b e b
Then we will maximize also F.Each of the above is a simpler ML problem, which is similar to ML parameters estimation for a die, treated next.
20
ML parameters estimation for a die
Let X be a random variable with 6 values x1,…,x6
denoting the six outcomes of a die. Here the parameters areθ ={1,2,3,4,5, 6} , ∑θi=1Assume that the data is one sequence:
Data = (x6,x1,x1,x3,x2,x2,x3,x4,x5,x2,x6)So we have to maximize
2 3 2 21 2 3 4 5 6( | )P Data
Subject to: θ1+θ2+ θ3+ θ4+ θ5+ θ6=1 [and θi 0 ]
252 3 2
1 2 3 4 51
i.e., ( | ) 1 ii
P Data
23
Maximum Likelihood EstimateBy the ML approach, we look for parameters that maximizes the probability of data (i.e., the likelihood function ).We will find the parameters by considering the corresponding log-likelihood function:
63 51 2 4
551 2 3 4 1
log ( | ) log 1N
N NN N Nii
P Data
5
16
5
11loglog
i ii ii NN
A necessary condition for (local) maximum is:
01
)|(log5
1
6
θ
i ij
j
j
NNDataP
24
Finding the Maximum
Rearranging terms:
ii
jj N
N Divide the jth equation by the ith
equation:
Sum from j=1 to 6:
ii
ii
j j
jj N
N
N
N
6
16
1
1
So there is only one local – and hence global – maximum. Hence the MLE is given by:
6,..,1 iN
N ii
6
65
1
6
1 NNN
i ij
j
25
Generalization for distribution with any number n of outcomes
Let X be a random variable with n values x1,…,xk denoting the k outcomes of Independently and Identically Distributed experiments, with parameters θ ={1,2,...,k} (θi is the probability of xi). Again, the data is one sequence of length n:
Data = (xi1,xi2
,...,xin)
Then we have to maximize
1 211 2( | ) , ( ... )knn n
kkP Data n n n
Subject to: θ1+θ2+ ....+ θk=1
11
1
1 11
i.e., ( | ) 1k
k
nknn
iki
P Data
26
Generalization for n outcomes (cont)
i k
i k
n n
By treatment identical to the die case, the maximum is obtained when for all i:
Hence the MLE is given by the relative frequencies:
1,..,ii
ni k
n
27
Fractional Exponents
Some models allow ni’s to be fractions (eg, if we are uncertain of a die outcome, we may consider it “6” with 20% confidence and “5” with 80%):We still can have
And the same analysis yields:
1,..,ii
ni k
n
1 211 2( | ) , ( ... )knn n
kkP Data n n n
28
Apply the ML method to HMM
Let Akl = #(transitions from k to l) in the training set.Ek(b) = #(emissions of symbol b from state k) in the training set. We need to:
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
( )
( , ) ( , )
Maximize [ ( )]kl kA E bkkl
k l k b
a e b
k Subject to: for all states , 1, and e ( ) 1, , ( ) 0.kl kl kl b
k a b a e b
29
Apply to HMM (cont.)
We apply the previous technique to get for each k the parameters {akl|l=1,..,m} and {ek(b)|bΣ}:
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
'' '
( ) , and ( )
( ')kl k
kl kkl kl b
A E ba e b
A E b
Which gives the optimal ML parameters
30
Summary of Case 1: Sequences are fully known
We know the complete structure of each sequence in the training set {x1,...,xn}. We wish to estimate akl and ek(b) for all pairs of states k, l and symbols b.
Summary: when everything is known, we can find the (unique set of) parameters θ* which maximizes
p(x1,...,xn| θ*) =MAXθ p(x1,...,xn| θ).
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
31
Adding pseudo counts in HMM
If the sample set is too small, we may get a biased result. In this case we modify the actual count by our prior knowledge/belief: rkl is our prior belief and transitions from k to l.rk(b) is our prior belief on emissions of b from state k.
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
' '' '
( ) ( )then , and ( )
( ) ( ( ') ( ))kl kl k k
kl kkl kl k kl b
A r E b r ba e b
A r E b r b
32
Case 2: State paths are unknown
For a given θ we have: p(x1,..., xn|θ)= p(x1| θ) p (xn|θ)
(since the xj are independent)
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
For each sequence x,p(x|θ)=∑s p(x,s|θ),
The sum taken over all state paths s which emit x.
33
Case 2: State paths are unknown
Thus, for the n sequences (x1,..., xn) we have:
p(x1,..., xn |θ)= ∑ p(x1,..., xn , s1,..., sn |θ),
Where the summation is taken over all tuples of n state paths (s1,..., sn ) which generate (x1,..., xn ) .For simplicity, we will assume that n=1.
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
(s1,..., sn )
34
Case 2: State paths are unknown
So we need to maximize p(x|θ)=∑s p(x,s|θ), where the summation is over all the sequences S which produce the output sequence x. Finding θ* which maximizes ∑s p(x,s|θ) is hard. [Unlike finding θ* which maximizes p(x,s|θ) for a single sequence (x,s).]
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
35
ML Parameter Estimation for HMM
The general process for finding θ in this case is1. Start with an initial value of θ.2. Find θ’ so that p(x|θ’) > p(x|θ) 3. set θ = θ’.4. Repeat until some convergence criterion is met.
A general algorithm of this type is the Expectation Maximization algorithm, which we will meet later. For the specific case of HMM, it is the Baum-Welch training.
36
Baum Welch training
We start with some values of akl and ek(b), which define prior values of θ. Then we use an iterative algorithm which attempts to replace θ by a θ* s.t.
p(x|θ*) > p(x|θ)This is done by “imitating” the algorithm for Case 1, where all states are known:
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
37
Baum Welch training
In case 1 we computed the optimal values of akl and ek(b), (for the optimal θ) by simply counting the number Akl of transitions from state k to state l, and the number Ek(b) of emissions of symbol b from state k, in the training set. This was possible since we knew all the states.
Si= lSi-1= k
xi-1= b
… …
xi= c
38
Baum Welch training
When the states are unknown, the counting process is replaced by averaging process:For each edge si-1 si we compute the average number of “k to l” transitions, for all possible pairs (k,l), over this edge. Then, for each k and l, we take Akl to be the sum over all edges.
Si= ?Si-1= ?
xi-1= b xi= c
… …
39
Baum Welch training
Similarly, For each edge si b and each state k, we compute the average number of times that si=k, which is the expected number of “k → b” transmission on this edge. Then we take Ek(b) to be the sum over all such edges. These expected values are computed by assuming the current parameters θ:
Si = ?
xi-1= b
40
Akl and Ek(b) when states are unknown
Akl and Ek(b) are computed according to the current distribution θ, that is:
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
Akl=∑sAsklp(s|x,θ), where As
kl is the number of k to l transitions in the sequence s.Ek(b)=∑sEs
k (b)p(s|x,θ), where Esk (b) is the number of times k
emits b in the sequence s with output x.
41
Baum Welch: step 1a Count expected number of state transitions
For each i, k,l, compute the state transitions probabilities by the current θ:
s1 SisL
X1 Xi XL
Si-1
Xi-1.. ..
P(si-1=k, si=l | x,θ)For this, we use the forwards and backwards algorithms
43
Baum Welch: Step 1a (cont)
Claim: By the probability distribution of HMM
s1 SisL
X1 Xi XL
Si-1
Xi-1.. ..
1
1( ) ( ) ( )( , | , )
( | )k kl l i l
i i
F i a e x B ip s k s l x
p x
(akl and el(xi) are the parameters defined by , and Fk(i-1), Bk(i) are the forward and backward algorithms)
44
Step 1a: Computing P(si-1=k, si=l | x,θ)
P(x1,…,xL,si-1=k,si=l|) = P(x1,…,xi-1,si-1=k|) aklel(xi ) P(xi+1,…,xL |si=l,)
= Fk(i-1) aklel(xi ) Bl(i)
Via the forward algorithm
Via the backward algorithm
s1 s2 sL-1 sL
X1 X2 XL-1 XL
Si-1
Xi-1
si
Xi
x
p(si-1=k,si=l | x, ) = Fk(i-1) aklel(xi ) Bl(i)
)|( xp
45
Step 1a (end)
For each pair (k,l), compute the expected number of state transitions from k to l, as the sum of the expected number of k to l transitions over all L edges :
1
1
1
1
11
( , , | )( | )
( ) ( ) ( )( | )
L
kl i i
i
L
k kl l i l
i
A p s k s l xp x
F i a e x B ip x
46
Step 1a for many sequences:
Exercise: Prove that when we have n independent input sequences (x1,..., xn ), then Akl is given by:
11 1
1 1
1( = , = , )
( )
1( 1) ( ) ( )
( )
|n L
kl i ijj i
n Lj j
kl l ik ljj i
jA p s k s lp x
f i a e x b ip x
x
where and are the forward and backward algorithms
for under .
j jk lf b
jx
47
Baum-Welch: Step 1b count expected number of symbols emissions
for state k and each symbol b, for each i where Xi=b, compute the expected number of times that Xi=b.
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi=b
),...,(/)()(
),...(/),...(
),...|(
Likik
LiL
Li
xxpsbsf
xxpksxxp
xxksp
1
11
1
48
Baum-Welch: Step 1b
For each state k and each symbol b, compute the expected number of emissions of b from k as the sum of the expected number of times that si = k, over all i’s for which xi = b.
1
:
( ) ( ) ( )( | )
i
k k k
i x b
E b F i B ip x
49
Step 1b for many sequences
Exercise: when we have n sequences (x1,..., xn ), the expected number of emissions of b from k is given by:
1 :
1( ) ( ) ( )
( ) ji
nj j
k k kjj i x b
E b F i B ip x
50
Summary of Steps 1a and 1b: the E part of the Baum Welch training
These steps compute the expected numbers Akl of k,l transitions for all pairs of states k and l, and the expected numbers Ek(b) of transmitions of symbol b from state k, for all states k and symbols b.
The next step is the M step, which is identical to the computation of optimal ML parameters when all states are known.
51
Baum-Welch: step 2
'' '
( ) , and ( )
( ')kl k
kl kkl kl b
A E ba e b
A E b
Use the Akl’s, Ek(b)’s to compute the new values of akl and ek(b). These values define θ*.
The correctness of the EM algorithm implies that: p(x1,..., xn|θ*) p(x1,..., xn|θ)
i.e, θ* increases the probability of the data
This procedure is iterated, until some convergence criterion is met.