chapter4.1 hmm

Upload: farrukhsharifzada

Post on 23-Feb-2018

240 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/24/2019 Chapter4.1 HMM

    1/13

    1

    Prof. Yechiam Yemini (YY)

    Computer Science DepartmentColumbia University

    Chapter 4: Hidden Markov Models

    4.1 Introduction to HMM

    2

    Overview

    !Markov models of sequence structures

    !Introduction to Hidden Markov Models (HMM)

    !HMM algorithms; Viterbi decoder

    Durbin chapters 3-5

  • 7/24/2019 Chapter4.1 HMM

    2/13

    2

    3

    The Challenges

    !Biological sequences have modular structure" Genes!exons, introns

    " Promoter regions!modules, promoters

    " Proteins!domains, folds, structural parts, active parts

    !How do we identify informative regions?

    "How do we find & map genes

    "How do we find & map promoter regions

    4

    Mapping Protein Regions

    E!"" T#$"

    Y!"#

    M#"$

    H#%&

    ATP Binding

    Domain

    Activation

    Domain

    Binding

    Site

    E!#&M!''

    F#$$E!"( Y!")

    Q!"!

    E'"*

    L'"$

    Y''*M')!

    F'(&E#"!E'%'

    Y#'!V##%

    Q#'&L#('

    E!($

    E!(!

    L!('

    E!)%

    L#*$

    D!)&

    F#$)

    L!'(M!)(

    A!&%

    S!&"

    L!&&

    E!(&

    E!)"V!)*

    K!)$E#$&

    Active components

    Interfaces

  • 7/24/2019 Chapter4.1 HMM

    3/13

    3

    5

    Statistical Sequence Analysis

    !Example: CpG islands indicate important regions"CG (denoted CpG) is typically transformed by methylation into TG

    "Promoter/start regions of gene suppress methylation

    "This leads to higher CpG density

    "How do we find CpG islands?

    !Example: active protein regions are statistically similar

    "Evolution conserves structural motifs but varies sequences

    !Simple comparison techniques are insufficient

    "Global/local alignment

    "Consensus sequence

    !The challenge: analyzing statistical features of regions

    6

    Review of Markovian Modeling

    !Recall: a Markov chain is described by transition probabilities

    " (n+1)=A(n) where (i,n)=Prob{S(n)=i} is the state probability

    "A(i,j)=Prob[S(n+1)=j|S(n)=i] is the transition probability

    !Markov chains describe statistical evolution

    " In time: evolutionary change depends on previous state only

    " In space: change depends on neighboring sites only

  • 7/24/2019 Chapter4.1 HMM

    4/13

    4

    7

    From Markov To Hidden Markov Models (HMM)

    !Nature uses different statistics for evolving different regions" Gene regions: CpG, promoters, introns/exons

    " Protein regions: active, interfaces, hydrophobic/philic

    !How can we tell regions?"Sample sequences have different statistics

    "Model regions as Markovian states emitting observed sequences

    !Example: CpG islands"Model: two connected MCs one for CpG one for normal

    "The MC is hidden; only sample sequences are seen

    "Detect transition to/from CpG MC

    "Similar to a dishonest casino: transition from fair to biased dice

    8

    Hidden Markov Models

    !HMM Basics

    "A Markov Chain: states & transition probabilities A=[a(i,j)]

    "Observable symbols for each state O(i)

    "A probability e(i,X) of emitting the symbol X at state i

    e(F,H)=0.5

    e(F,T)=0.5

    e(B,H)=0.9

    e(B,T)=0.1

    .4.6

    .2

    .8

    F B

  • 7/24/2019 Chapter4.1 HMM

    5/13

    5

    9

    Coin Example

    !Two states MC: {F,B} F=fair coin, B=biased!Emission probabilities"Described in state boxes

    "Or through emission boxes

    !Example: transmembrane proteins

    "Hydrophilic/hydrophobic regions

    e(F,H)=0.5

    e(F,T)=0.5

    e(B,H)=0.9

    e(B,T)=0.1

    .4.6

    .2

    .8

    F.4.6 .2.8

    B

    H T

    .5 .5 .9 .1

    10

    HMM Profile Example (Non-gapped)

    T A GT A TT A AT A TT A AT A TT A TA A TT A AT A TT A TT A TG A AT A TG A AT C TT A GC A TT A TT A T

    0 10 51 9 00 0 1 1 02 0 20 0 08 0 38 0 10

    ACGT

    0

    A=.0

    C=.0

    G=.2

    T=.8

    A=1.

    C=.0

    G=.0

    T=.0

    A=.1

    C=.1

    G=.0

    T=.8

    A=.5

    C=.0

    G=.2

    T=.3

    A=.9

    C=.1

    G=.0

    T=.0

    A=.0

    C=.0

    G=.0

    T=1.

    S E1. 1. 1. 1. 1. 1. 1.

    A State per Site

  • 7/24/2019 Chapter4.1 HMM

    6/13

    6

    11

    How Do We Model Gaps?

    !Gap can result from deletion or insertion"Deletion = hidden delete state

    "Insertion= hidden insert state

    T A GT ATT A AT CT

    T A AT -TT A TT -TT A AT -TT A TT -TG A AT -TG A AT -TT A GC -TT A TT -T

    A=.6

    C=.2

    G=.2

    T=.0

    A=.0

    C=.0

    G=.0

    T=1.

    E

    A=.25

    C=.25

    G=.25

    T=.25

    - =1.

    S

    A=.5

    C=.0

    G=.2

    T=.3

    A=.0

    C=.0

    G=.2

    T=.8

    A=1.

    C=.0

    G=.0

    T=.0

    A=.1

    C=.1

    G=.0

    T=.8

    .7 1.

    T A GT ATT A AT CTT A AT ATT A TT ATT A AT GTT A TT -TG A AT ATG A AT -TT A GC -TT A TT CT

    S

    A=.5

    C=.0

    G=.2

    T=.3

    A=.0

    C=.0

    G=.2

    T=.8

    A=1.

    C=.0

    G=.0

    T=.0

    A=.1

    C=.1

    G=.0

    T=.8

    A=.0

    C=.0

    G=.0

    T=1.

    E

    .3 1.

    1.

    Delete-State

    Insert-State

    .8

    .2

    12

    ACA - - - ATG

    TCA ACT ATC

    ACA C - - AGC

    AGA - - - ATC

    ACC G - - ATC

    Transition probabilities

    Output Probabilities

    insertion

    Profile HMM

    !Profile alignment

    "E.g., What is the most likely path to generate ACATATC ?

    "How likely is ACATATC to be generated by this profile?

  • 7/24/2019 Chapter4.1 HMM

    7/13

    7

    13

    In General: HMM Sequence Profile

    S EM

    D D D

    M M

    i i i i

    14

    HMM For CpG Islands

    A+ G+

    C+ T+

    A- G-

    C- T-

    A G C T

    A

    G

    C

    T

    0.1800.2740.4260.120

    0.1710.3680.2740.188

    0.1610.3390.3750.125

    0.0790.3550.3840.182

    + A G C T

    A

    G

    C

    T

    0.3 0.2050.285 0.21

    0.3220.2980.0780.302

    0.2480.2460.2980.208

    0.1770.2390.2920.292

    +

    A G C T

    A

    G

    C

    T

    +

    CpG generator Regular Sequence

  • 7/24/2019 Chapter4.1 HMM

    8/13

    8

    15

    Modeling Gene Structure With HMM

    !Genes are organized into sequential functional regions

    !Regions have distinct statistical behaviors

    Splice site

    16

    HMM Gene Models

    !HMM state!region ; Markov transitions between regions

    !Emission {A,C,T,G}; regions have different probabilities

    A=.29

    C=.31

    G=.04

    T=.36

  • 7/24/2019 Chapter4.1 HMM

    9/13

    9

    17

    Computing Probabilities on HMM

    !Path = a sequence of states"E.g., X=FFBBBF

    "Path probability: 0.5 (0.6)20.4(0.2)30.8 =4.608 *10-4

    !Probability of a sequence emitted by a path: p(S|X)"E.g., p(HHHHHH|FFBBBF)=p(H|F)p(H|F)p(H|B)p(H|B)p(H|B)p(H|F)

    =(0.5)3(0.9)3=0.09

    !Note: usually one avoids multiplications and computeslogarithms to minimize error propagation

    e(F,H)=0.5

    e(F,T)=0.5

    e(B,H)=0.9

    e(B,T)=0.1

    .4.6

    .2

    .8

    Start.5.5

    18

    The Three Computational Problems of HMM

    !Decoding: what is its most likely sequence of transitions &

    emissions that generated a given observed sequence?

    !Likelihood: how likely is an observed sequence to have

    been generated by a given HMM?

    !Learning: how should transition and emission probabilities be

    learned from observed sequences?

  • 7/24/2019 Chapter4.1 HMM

    10/13

    10

    19

    The Decoding Problem: Viterbis Decoder

    !Input: an observed sequence S!Output: a hidden path X maximizing P(S|X)

    !Key Idea (Viterbi): map to a dynamic programming problem

    "Describe the problem as optimizing a path over a grid

    "DP search: (a) compute price of forward paths (b) backtrack

    "Complexity: O(m2n) (m=number of states, n= sequence size)

    1 2 3 4 5 6 7 8 9 1012131415161718

    x

    s1s2s3s4s5

    A1

    A2A3

    A4

    A5

    A6

    Sta

    tes

    Observed sequence

    State sequence

    20

    Viterbis Decoder

    !F(i,k) = probability of the most likely path to state igenerating S1Sk

    !Forward recursion:

    F(i,k+1)=e(i,Sk+1)*maxj{F(j,k)a(i,j)}

    !Backtracking: start with highest F(i,n) and backtrack

    !Initialization: F(0,0)=1, F(i,0)=0

    i

    1 2 3 4

    A1

    A2

    A3

    A4

    A5A6

    k

    Sk+1

    k+1

    F(6,k)a(6,i)

    Best path to i

  • 7/24/2019 Chapter4.1 HMM

    11/13

    11

    21

    Example: Dishonest Coin Tossing

    !what is the most likely sequence of transitions &emissions to explain the observation: S=HHHHHH

    .45

    .25

    1

    F

    B

    H

    e(F,H)=0.5

    e(F,T)=0.5

    e(B,H)=0.9

    e(B,T)=0.1

    .4.6

    .2

    .8

    Start

    .5.5

    Start

    1

    1

    0

    .6*.5=.3

    .4*.9=.36

    .2*.9=.18

    .8*.5=.4

    w=.5*(.9)*(.4)2*(.3)*(.36)2

    .09

    .18

    2

    H

    .065

    .054

    3

    H

    .019

    .026

    4

    H

    .006

    .008

    5

    H

    .0028

    .0023

    6

    H

    F

    B

    22

    Example: CpG Islands

    !Given: observed sequence CGCG what is the likely

    state sequence generating it?

    A+ G+

    C+ T+

    A- G-

    C- T-

    1 2 3 4

    T+

    C+

    G+

    A+

    T-

    C-

    G-

    A-

    C G C G

    0

    Start

    A G TC

    Start

  • 7/24/2019 Chapter4.1 HMM

    12/13

    12

    23

    !Computing probability products propagates errors

    !Instead of multiplying probabilities add log-likelihood

    !Define f(i,k)=log F(i,k)

    !Or, define the weight w(i,j,k)=log e(i,Sk+1)+ log a(i,j)

    To get the following standard DP formulation

    Computational Note

    f(i,k+1)=log e(i,Sk+1) + maxj{f(j,k)+log a(i,j)}

    f(i,k+1)=maxj{f(j,k)+w(i,j,k)}

    24

    Example

    !what is the most likely sequence of transitions & emissions to

    explain the observation: S=HHHHHH

    " (using base 2 log)

    F

    B

    H

    lge(F,H)=-1

    lge(F,T)=-1

    lge(B,H)=-0.15

    lge(B,T)=-3.32

    -1.32

    -0.74 -2.32

    -0.32

    Start

    -1-1

    Start

    -.74-1=-1.74

    -1.32-.15=-1.47

    -2.32-0.15=-2.47

    -.32-1=-1.32

    F

    B

    -2.47-2

    -3.47-1.15

    H H H H H

    f(i,k+1)=maxj{f(j,k)+w(i,j,k)}

    -1.15-2S

    -2.47-1.32B

    -1.47-1.74F

    BF

    W(.,.,H)

  • 7/24/2019 Chapter4.1 HMM

    13/13

    13

    25

    Concluding Notes

    !Viterbi decoding: hidden pathway of an observed sequence

    !Hidden pathway explains the underlying structure

    "E.g., identify CpG islands

    "E.g., align a sequence against a profile

    "E.g., determine gene structure

    "..

    !This leaves the two other HMM computational problems

    "How do we extract an HMM model, from observed sequences?

    "How do we compute the likelihood of a given sequence?