algorithms in computational biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 11

Algorithms in Computational Biology

Markov Chains and Hidden Markov Model


Example: CpG Islands

• Dinucleotide CG (CpG to distinguish it from C-G base pair)• C within CG is typically methylated• Methyl-C is more likely to mutate to T

• CpG dinucleotides are rarer in genome than would be expected from the independent probabilities of C and G

• Methylation process is suppressed in short stretches of the genome• More CpG dinucleotides in promoter regions of genes

• CpG islands are regions with many CpGs• Typically a few hundred to a few thousand bases long


Questions about CpG Island?

• Given a short stretch of genomic sequence, how would we decide if it comes from a CpG island or not?

• Given a long piece of sequence, how would we find the CpG islands in it if there are any?


Markov Chains

A

G

T

C

)()...,...,|(),....,|(

),...,,()(

Llength of sequence ofy probabilit the),()|(),( Since

)|(:yprobabilit Transition

112111

11

1

xPxxxPxxxP

xxxPxP

xyPyxPyxP

sxtxPa

LLLL

LL

iist


Key Property of a Markov Chain

• The probability of each symbol xi depends only on the value of the preceding symbol xi-1

L

ixx

LLLL

LL

iiaxP

xPxxPxxPxxP

xxxPxP

21

112211

11

1)(

)()|()...|()|(

),...,,()(


Modeling the Beginning and End of Sequences

A

G

T

C

B E

tEL

Bs

atxEP

asxP

)|(

)( 1


Using Markov Chains for Discrimination

+ A C G T

A 0.180 0.274 0.426 0.120

C 0.171 0.368 0.274 0.188

G 0.161 0.339 0.375 0.125

T 0.079 0.355 0.384 0.182

- A C G T

A 0.300 0.205 0.285 0.210

C 0.322 0.298 0.078 0.302

G 0.248 0.246 0.298 0.208

T 0.177 0.239 0.292 0.292

CpG island model

Non-CpG island model

''

tstc

ca st

st


Cont’

• For discrimination, the log-odds ratio is calculated:

L

ixx

L

i xx

xx

ii

ii

ii

a

a

xP

xPxS

221

1

1log

)IslandCpG Non |(

)IslandCpG |(log)(

A C G T

A -0.740 0.419 0.580 -0.803

C -0.913 0.302 1.812 -0.685

G -0.624 0.461 0.331 -0.730

T -1.169 0.573 0.393 -0.679


Histogram of Length-Normalized Scores

Non-CpG Islands

CpG islands


Locating CpG Islands in a DNA Sequence

Input: A long DNA sequence X = {x1, x2, …, xL)*

Output: CpG islands along X.• Use Markov chain models• Calculate log-odds score for a window of length k

(e.g., 100)• A total of L-k+1 scores will be computed and plotted• CpG islands will stand out with positive values


Problems with Markov Chain Models in Locating CpG Islands

• CpG islands have sharp boundaries• CpG islands have variable lengths

These problems can be better addressed by building a single model for the entire sequence that incorporates both Markov chains


Formal Definition of an HMM

• A hidden Markov model is a triplet M = (, Q, ), where is an alphabet of symbols• Q is a finite set of states, capable of emitting symbols

from the alphabet is a set of probabilities, comprised of

• State transition probabilities, denoted by akl for each k, l Q

• Emission probabilities, denoted by ek(b) for each state k Q and b


Cont’

(State sequence or path)• = (1, 2, …, L)

• Follows a simple Markov chain (the probability of a state depends only on the previous state)

• State transition probability• akl = P{i = l| i-1 = k}

• Emission probability• Given a sequence X = (x1, x2, … xL), emission probability ek(b) is defined as:

ek(b) = P{xi=b| i = k}

• The probability that the sequence X was generated by M given the path is:

end theis and begin, theis Where

)(),(

1L0

1,, 110

L

ii iii

axeaXP


An HMM for Detecting CpG Islands in a Long DNA Sequence

• Alphabet: = {A, C, G, T}

• States: Q = {A+, C+, G+, T+, A-, C-, G-, T-}

• EmissionsState: A+ C+

G+ T+ A-

C- G- T-

Emitted symbol: A C G T A C G T

The emission probability of each state x+ and x- is 1 for emitting symbol x and 0 for emitting other symbols (special feature of this HMM)


Transition Matrix for CpG Island HMM

P is the probability of staying in a CpG island, and q is the probability of staying in a non-CpG Island


Occasionally Dishonest Casino Dealer

• In casino a dealer uses a fair die most of the time, but occasionally he switch to a loaded die. The loaded die has a probability of 0.5 for a six and probability of 0.1 for the numbers one to five. The dealer switches from a fair to a loaded die with probability of 0.05 before each roll, and that the probability of switching back is 0.1.

• In each state of the Markov process the outcomes of a roll have different probabilities, thus the process can modeled using a HMM


HMM for the Occasionally Dishonest Casino Dealer

• Q = {F, L} = {1, 2, 3, 4, 5, 6}

What is hidden?


HMMs Generate Sequences

• Generate a sequence via HMM• Choose 1 according to the probabilities a0i

• An observation (x1) is emitted according to the probabilities e1

• Choose 2 according to the probabilities a1i

• An observation (x2) is emitted according to the probabilities e2

• And so forth ……

• P(x) is the probability that sequence x was generated by the model

• The joint probability of an observed sequence x and a state sequence :

0 where)(),( 11

0 11

L

L

ii iiiaxeaxP


Most Probable State Path

• A CpG island example• Sequence CGCG can be emitted by:

• (C+, G+, C+, G+), (C-, G-, C-, G-), (C+, G-, C+, G-)

• Which state sequence is more likely for the observation?

• Most probable path is defined as:

• The probability vk(i) of the most probable path ending in state k with observation i is known for all the states k, then vl(i+1) is defined:

),(maxarg*

xP

))((max)()1( 1 klkk

ill aivxeiv


Finding Most Probable Path Using Viterbi Algorithm

Initialization (i = 0): v0(0) = 1, vk(0) = 0 for k > 0

Recursion: (i = 1…L):

vl(i) = el(xi)maxk(vk(i-1)akl)

ptri(l) = argmaxk(vk(i-1)akl)

Termination:

P(x, *) = maxk(vk(L)ak0)

L*= argmaxk(vk(L)ak0)

Traceback (i=L…1): i-1*= ptri(i

*)


Viterbi Example

V C G C G

B 1 0 0 0 0

A+ 0 0 0 0 0

C+ 0 0.13 0 0.011 0

G+ 0 0 0.034 0 0.003

T+ 0 0 0 0 0

A- 0 0 0 0 0

C- 0 0.13 0 0.002 0

G- 0 0 0.010 0 0.0002

T- 0 0 0 0 0

Most probable path for sequence CGCG


Sequence of Die Rolls Predicted by Viterbi Algorithm


Finding the Probability of a Sequence for an HMM: the Forward Algorithm

Definitions:

Algorithm:

Initialization ( i = 0):

Recursion (i = 1…L):

Termination:

kklkill

iik

aifxeif

kxxPif

)()()1(

),,...,()(

1

1

0for 0)0(,1)0(0 kff k

k

klkill aifxeif )1()()(

k

kk aLfxP 0)()(


Posterior State Probability

• We want to know the most probable state for an observation xi

• We need to find out the probability that observation xi came from each state k given the observed sequence

)(

)()(

)(

)|...()(

)(

),...|...(),...(

)(

),()|(

1

111

xP

ibif

xP

kxxPif

xP

kxxxxPkxxP

xP

kxPxkP

kkiLik

iiLiii

ii


Finding bk(i) Using Backward Algorithm

Initialization (i = L):

Recursion (i = L-1, …, 1):

Termination:

kaLb kk allfor )( 0

l

lilklk ibxeaib )1()()( 1

l

lll bxeaxP )1()()( 10


Posterior Decoding

• Approach 1

• Approach 2

• E.g. Find the posterior probability according to the model that base i is in a CpG island, we can let

g(k) = 1 for k {A+, C+, G+, T+}

g(k) = 0 for k {A-, C-, G-, T-}G(i|k) is precisely the posterior probability

)|(maxargˆ xkP ik

i

k

i kgxkPxiG )()|()|(


Use of Posterior Decoding

Shaded areas show when the roll was generated by the

loaded die


Parameter Estimation for HMMs

• Model specification• Structure design

• What states there are and how they are connected

• Assignment of parameter values• Transition probabilities akl

• Emission probabilities ek(b)

• Estimation framework• Training sequences x1, …, xn

• Work in log space

n

j

jnn xPxxPxxl1

11 )|(log)|,...,(log),...,(


Estimation When the State Sequence is Known

goverfittinprevent

totspseudocoun nedpredetermi are and Both

data gin trainin from of emissions ofnumber )(

data gin trainin to ns transitioofnumber

)'(

)()(

'''

(b)rr

(b)rkbbE

rlkA

bE

bEbe

A

Aa

kkl

kk

klkl

bk

kk

lkl

klkl


Estimation When Paths Are Unknown

• Baum (1971)• Calculate Akl and Ek(b) as the expected times each transition or

emission is used given the training sequences• Subject to local maxima• Depends only the starting values of the parameters

• The probability that akl is used at position i in sequence x is:

)(

)1()()(),|,( 1

1 xP

ibxeaifxlkP lilklk

ii


Expected Transition and Emission Counts

• The expected number of times that akl can be obtained by summing over all positions and over all training sequences

• The expected number of times that letter b appears in state k

)1()()()(

11

1

ibxeaifxP

A jl

i

jlkl

jk

n

jjkl i

n

j bxi

jk

jkjk

ji

ibifxP

bE1 |

)()()(

1)(


Baum-Welch Training (EM algorithm)

Initialization: Pick arbitrary model parameters

Recurrence:Set all the A and E variables to their pseudocount values r (or to zero)For each sequence j = 1 … n

Calculate fk(i) for sequence j using forward algorithmCalculate bk(i) for sequence j using backward algorithmAdd the contribution of sequence j to A and E

Calculate the new model parametersCalculate the new log likelihood of the model

Termination:Stop if the change in log likelihood is less than some predefined threshold or the maximum number of iterations is exceeded


Modeling of Labeled Sequences

• HMMs can be used to predict the labeling of unannotated sequences

• Training for HMMs• Separately train the model for CpG islands and the model for non-CpG

islands

• Combine them into a larger HMM

• Tedious especially if there are more two classes involved

• It will be nice to estimate everything at once• Training set includes all classes (e.g., CpG islands and non-CpG islands)

• Each sequence is labeled with corresponding classes

• Let y = y1, …, yL be the labels on the observation x = x1, …, xL


Cont’

• Model can be estimated with a slight modification of Baum-Welch algorithm• Allow only valid paths through the model

• A valid path is one where the state labels and sequence labels are the same, i.e., i has label yi

• During the forward and backward algorithms this corresponds to setting fl(i) = 0 and bl(i) = 0 for all the states l with a label different from yi


Discriminative Estimation

• When modeling labeled sequences, the following likelihood is maximized

• Obtaining a good prediction of y is our primary interest, it is preferable to maximize the following conditional maximum likelihood

)|,(maxarg

yxPML

),|(maxarg

xyPCML

)|(

)|,(),|( asrewritten becan )

xP

yxPxyPP(y|x,θ

Probability calculated by the forward algorithm for the

labeled sequences

Probability calculated by the forward algorithm

disregarding all the labels


HMM Model Structure

• Choice of model topology• Fully connected model causes local maxima• In practice, successful HMMs are constructed by carefully deciding

which transitions are allowed in the model based on knowledge about the problem under investigation

• Duration modeling• Probability decays exponentially on lengths (geometric distribution)

• P(L)=(1-p)p^(L-1) (p: self-transition 1-p: probability of leaving it)• Model more complex length distribution

• Introduce several states with the same distribution over residues and transitions between each other.

• E.g. Non-negative binomial distribution

p p p p

1-p 1-p 1-p

nnl ppN

LLP )1(

1

1)(


Numerical Stability of HMM Algorithms

• Probability gets too low when multiplying many probabilities in the Viterbi, forward and backward algorithms

• Consequences• Underflow error• Program would crash• Program would keep running and produce arbitrary wrong

numbers


Improving Numerical Stability

• Log transform

• Scaling of probabilities• For each i define a scaling variable si

)~)((max)(~)1( 1 klkk

ill aiVxeiV

lillkl

ik

kklkil

il

i

j j

ll

xeibas

ib

aifxes

if

s

ifif

)()1(~1

)(~

)(~

)(1

)1(~

)()(

~

1

11

1

algorithms in computational biology

Documents