algorithms in computational biology

38
Department of Mathematics & Computer Science Algorithms in Computational Biology 1 Algorithms in Computational Biology Markov Chains and Hidden Markov Model

Upload: liluye

Post on 24-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Algorithms in Computational Biology. Markov Chains and Hidden Markov Model. Example: CpG Islands. Dinucleotide CG ( CpG to distinguish it from C-G base pair) C within CG is typically methylated Methyl-C is more likely to mutate to T - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 11

Algorithms in Computational Biology

Markov Chains and Hidden Markov Model

Page 2: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 22

Example: CpG Islands

• Dinucleotide CG (CpG to distinguish it from C-G base pair)• C within CG is typically methylated• Methyl-C is more likely to mutate to T

• CpG dinucleotides are rarer in genome than would be expected from the independent probabilities of C and G

• Methylation process is suppressed in short stretches of the genome• More CpG dinucleotides in promoter regions of genes

• CpG islands are regions with many CpGs• Typically a few hundred to a few thousand bases long

Page 3: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 33

Questions about CpG Island?

• Given a short stretch of genomic sequence, how would we decide if it comes from a CpG island or not?

• Given a long piece of sequence, how would we find the CpG islands in it if there are any?

Page 4: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 44

Markov Chains

A

G

T

C

)()...,...,|(),....,|(

),...,,()(

Llength of sequence ofy probabilit the),()|(),( Since

)|(:yprobabilit Transition

112111

11

1

xPxxxPxxxP

xxxPxP

xyPyxPyxP

sxtxPa

LLLL

LL

iist

Page 5: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 55

Key Property of a Markov Chain

• The probability of each symbol xi depends only on the value of the preceding symbol xi-1

L

ixx

LLLL

LL

iiaxP

xPxxPxxPxxP

xxxPxP

21

112211

11

1)(

)()|()...|()|(

),...,,()(

Page 6: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 66

Modeling the Beginning and End of Sequences

A

G

T

C

B E

tEL

Bs

atxEP

asxP

)|(

)( 1

Page 7: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 77

Using Markov Chains for Discrimination

+ A C G T

A 0.180 0.274 0.426 0.120

C 0.171 0.368 0.274 0.188

G 0.161 0.339 0.375 0.125

T 0.079 0.355 0.384 0.182

- A C G T

A 0.300 0.205 0.285 0.210

C 0.322 0.298 0.078 0.302

G 0.248 0.246 0.298 0.208

T 0.177 0.239 0.292 0.292

CpG island model

Non-CpG island model

''

tstc

ca st

st

Page 8: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 88

Cont’

• For discrimination, the log-odds ratio is calculated:

L

ixx

L

i xx

xx

ii

ii

ii

a

a

xP

xPxS

221

1

1log

)IslandCpG Non |(

)IslandCpG |(log)(

A C G T

A -0.740 0.419 0.580 -0.803

C -0.913 0.302 1.812 -0.685

G -0.624 0.461 0.331 -0.730

T -1.169 0.573 0.393 -0.679

Page 9: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 99

Histogram of Length-Normalized Scores

Non-CpG Islands

CpG islands

Page 10: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 1010

Locating CpG Islands in a DNA Sequence

Input: A long DNA sequence X = {x1, x2, …, xL)*

Output: CpG islands along X.• Use Markov chain models• Calculate log-odds score for a window of length k

(e.g., 100)• A total of L-k+1 scores will be computed and plotted• CpG islands will stand out with positive values

Page 11: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 1111

Problems with Markov Chain Models in Locating CpG Islands

• CpG islands have sharp boundaries• CpG islands have variable lengths

These problems can be better addressed by building a single model for the entire sequence that incorporates both Markov chains

Page 12: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 1212

Formal Definition of an HMM

• A hidden Markov model is a triplet M = (, Q, ), where is an alphabet of symbols• Q is a finite set of states, capable of emitting symbols

from the alphabet is a set of probabilities, comprised of

• State transition probabilities, denoted by akl for each k, l Q

• Emission probabilities, denoted by ek(b) for each state k Q and b

Page 13: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 1313

Cont’

(State sequence or path)• = (1, 2, …, L)

• Follows a simple Markov chain (the probability of a state depends only on the previous state)

• State transition probability• akl = P{i = l| i-1 = k}

• Emission probability• Given a sequence X = (x1, x2, … xL), emission probability ek(b) is defined as:

ek(b) = P{xi=b| i = k}

• The probability that the sequence X was generated by M given the path is:

end theis and begin, theis Where

)(),(

1L0

1,, 110

L

ii iii

axeaXP

Page 14: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 1414

An HMM for Detecting CpG Islands in a Long DNA Sequence

• Alphabet: = {A, C, G, T}

• States: Q = {A+, C+, G+, T+, A-, C-, G-, T-}

• EmissionsState: A+ C+

G+ T+ A-

C- G- T-

Emitted symbol: A C G T A C G T

The emission probability of each state x+ and x- is 1 for emitting symbol x and 0 for emitting other symbols (special feature of this HMM)

Page 15: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 1515

Transition Matrix for CpG Island HMM

P is the probability of staying in a CpG island, and q is the probability of staying in a non-CpG Island

Page 16: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 1616

Occasionally Dishonest Casino Dealer

• In casino a dealer uses a fair die most of the time, but occasionally he switch to a loaded die. The loaded die has a probability of 0.5 for a six and probability of 0.1 for the numbers one to five. The dealer switches from a fair to a loaded die with probability of 0.05 before each roll, and that the probability of switching back is 0.1.

• In each state of the Markov process the outcomes of a roll have different probabilities, thus the process can modeled using a HMM

Page 17: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 1717

HMM for the Occasionally Dishonest Casino Dealer

• Q = {F, L} = {1, 2, 3, 4, 5, 6}

What is hidden?

Page 18: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 1818

HMMs Generate Sequences

• Generate a sequence via HMM• Choose 1 according to the probabilities a0i

• An observation (x1) is emitted according to the probabilities e1

• Choose 2 according to the probabilities a1i

• An observation (x2) is emitted according to the probabilities e2

• And so forth ……

• P(x) is the probability that sequence x was generated by the model

• The joint probability of an observed sequence x and a state sequence :

0 where)(),( 11

0 11

L

L

ii iiiaxeaxP

Page 19: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 1919

Most Probable State Path

• A CpG island example• Sequence CGCG can be emitted by:

• (C+, G+, C+, G+), (C-, G-, C-, G-), (C+, G-, C+, G-)

• Which state sequence is more likely for the observation?

• Most probable path is defined as:

• The probability vk(i) of the most probable path ending in state k with observation i is known for all the states k, then vl(i+1) is defined:

),(maxarg*

xP

))((max)()1( 1 klkk

ill aivxeiv

Page 20: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 2020

Finding Most Probable Path Using Viterbi Algorithm

Initialization (i = 0): v0(0) = 1, vk(0) = 0 for k > 0

Recursion: (i = 1…L):

vl(i) = el(xi)maxk(vk(i-1)akl)

ptri(l) = argmaxk(vk(i-1)akl)

Termination:

P(x, *) = maxk(vk(L)ak0)

L*= argmaxk(vk(L)ak0)

Traceback (i=L…1): i-1*= ptri(i

*)

Page 21: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 2121

Viterbi Example

V C G C G

B 1 0 0 0 0

A+ 0 0 0 0 0

C+ 0 0.13 0 0.011 0

G+ 0 0 0.034 0 0.003

T+ 0 0 0 0 0

A- 0 0 0 0 0

C- 0 0.13 0 0.002 0

G- 0 0 0.010 0 0.0002

T- 0 0 0 0 0

Most probable path for sequence CGCG

Page 22: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 2222

Sequence of Die Rolls Predicted by Viterbi Algorithm

Page 23: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 2323

Finding the Probability of a Sequence for an HMM: the Forward Algorithm

Definitions:

Algorithm:

Initialization ( i = 0):

Recursion (i = 1…L):

Termination:

kklkill

iik

aifxeif

kxxPif

)()()1(

),,...,()(

1

1

0for 0)0(,1)0(0 kff k

k

klkill aifxeif )1()()(

k

kk aLfxP 0)()(

Page 24: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 2424

Posterior State Probability

• We want to know the most probable state for an observation xi

• We need to find out the probability that observation xi came from each state k given the observed sequence

)(

)()(

)(

)|...()(

)(

),...|...(),...(

)(

),()|(

1

111

xP

ibif

xP

kxxPif

xP

kxxxxPkxxP

xP

kxPxkP

kkiLik

iiLiii

ii

Page 25: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 2525

Finding bk(i) Using Backward Algorithm

Initialization (i = L):

Recursion (i = L-1, …, 1):

Termination:

kaLb kk allfor )( 0

l

lilklk ibxeaib )1()()( 1

l

lll bxeaxP )1()()( 10

Page 26: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 2626

Posterior Decoding

• Approach 1

• Approach 2

• E.g. Find the posterior probability according to the model that base i is in a CpG island, we can let

g(k) = 1 for k {A+, C+, G+, T+}

g(k) = 0 for k {A-, C-, G-, T-}G(i|k) is precisely the posterior probability

)|(maxargˆ xkP ik

i

k

i kgxkPxiG )()|()|(

Page 27: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 2727

Use of Posterior Decoding

Shaded areas show when the roll was generated by the

loaded die

Page 28: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 2828

Parameter Estimation for HMMs

• Model specification• Structure design

• What states there are and how they are connected

• Assignment of parameter values• Transition probabilities akl

• Emission probabilities ek(b)

• Estimation framework• Training sequences x1, …, xn

• Work in log space

n

j

jnn xPxxPxxl1

11 )|(log)|,...,(log),...,(

Page 29: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 2929

Estimation When the State Sequence is Known

goverfittinprevent

totspseudocoun nedpredetermi are and Both

data gin trainin from of emissions ofnumber )(

data gin trainin to ns transitioofnumber

)'(

)()(

'''

(b)rr

(b)rkbbE

rlkA

bE

bEbe

A

Aa

kkl

kk

klkl

bk

kk

lkl

klkl

Page 30: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 3030

Estimation When Paths Are Unknown

• Baum (1971)• Calculate Akl and Ek(b) as the expected times each transition or

emission is used given the training sequences• Subject to local maxima• Depends only the starting values of the parameters

• The probability that akl is used at position i in sequence x is:

)(

)1()()(),|,( 1

1 xP

ibxeaifxlkP lilklk

ii

Page 31: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 3131

Expected Transition and Emission Counts

• The expected number of times that akl can be obtained by summing over all positions and over all training sequences

• The expected number of times that letter b appears in state k

)1()()()(

11

1

ibxeaifxP

A jl

i

jlkl

jk

n

jjkl i

n

j bxi

jk

jkjk

ji

ibifxP

bE1 |

)()()(

1)(

Page 32: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 3232

Baum-Welch Training (EM algorithm)

Initialization: Pick arbitrary model parameters

Recurrence:Set all the A and E variables to their pseudocount values r (or to zero)For each sequence j = 1 … n

Calculate fk(i) for sequence j using forward algorithmCalculate bk(i) for sequence j using backward algorithmAdd the contribution of sequence j to A and E

Calculate the new model parametersCalculate the new log likelihood of the model

Termination:Stop if the change in log likelihood is less than some predefined threshold or the maximum number of iterations is exceeded

Page 33: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 3333

Modeling of Labeled Sequences

• HMMs can be used to predict the labeling of unannotated sequences

• Training for HMMs• Separately train the model for CpG islands and the model for non-CpG

islands

• Combine them into a larger HMM

• Tedious especially if there are more two classes involved

• It will be nice to estimate everything at once• Training set includes all classes (e.g., CpG islands and non-CpG islands)

• Each sequence is labeled with corresponding classes

• Let y = y1, …, yL be the labels on the observation x = x1, …, xL

Page 34: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 3434

Cont’

• Model can be estimated with a slight modification of Baum-Welch algorithm• Allow only valid paths through the model

• A valid path is one where the state labels and sequence labels are the same, i.e., i has label yi

• During the forward and backward algorithms this corresponds to setting fl(i) = 0 and bl(i) = 0 for all the states l with a label different from yi

Page 35: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 3535

Discriminative Estimation

• When modeling labeled sequences, the following likelihood is maximized

• Obtaining a good prediction of y is our primary interest, it is preferable to maximize the following conditional maximum likelihood

)|,(maxarg

yxPML

),|(maxarg

xyPCML

)|(

)|,(),|( asrewritten becan )

xP

yxPxyPP(y|x,θ

Probability calculated by the forward algorithm for the

labeled sequences

Probability calculated by the forward algorithm

disregarding all the labels

Page 36: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 3636

HMM Model Structure

• Choice of model topology• Fully connected model causes local maxima• In practice, successful HMMs are constructed by carefully deciding

which transitions are allowed in the model based on knowledge about the problem under investigation

• Duration modeling• Probability decays exponentially on lengths (geometric distribution)

• P(L)=(1-p)p^(L-1) (p: self-transition 1-p: probability of leaving it)• Model more complex length distribution

• Introduce several states with the same distribution over residues and transitions between each other.

• E.g. Non-negative binomial distribution

p p p p

1-p 1-p 1-p

nnl ppN

LLP )1(

1

1)(

Page 37: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 3737

Numerical Stability of HMM Algorithms

• Probability gets too low when multiplying many probabilities in the Viterbi, forward and backward algorithms

• Consequences• Underflow error• Program would crash• Program would keep running and produce arbitrary wrong

numbers

Page 38: Algorithms in Computational Biology

Department of Mathematics & Computer Science Algorithms in Computational Biology 3838

Improving Numerical Stability

• Log transform

• Scaling of probabilities• For each i define a scaling variable si

)~)((max)(~)1( 1 klkk

ill aiVxeiV

lillkl

ik

kklkil

il

i

j j

ll

xeibas

ib

aifxes

if

s

ifif

)()1(~1

)(~

)(~

)(1

)1(~

)()(

~

1

11

1