master’s course bioinformatics data analysis and tools

94
Master’s course Bioinformatics Data Analysis and Tools Lecture 12: (Hidden) Markov models Centre for Integrative Bioinformatics

Upload: tariq

Post on 04-Jan-2016

36 views

Category:

Documents


1 download

DESCRIPTION

Master’s course Bioinformatics Data Analysis and Tools. Lecture 12: (Hidden) Markov models Centre for Integrative Bioinformatics. Problem in biology. Data and patterns are often not clear cut - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Master’s course Bioinformatics Data Analysis and Tools

Master’s course

Bioinformatics Data Analysis and Tools

Lecture 12: (Hidden) Markov models

Centre for Integrative Bioinformatics

Page 2: Master’s course Bioinformatics Data Analysis and Tools

Problem in biology

• Data and patterns are often not clear cut• When we want to make a method to recognise a

pattern (e.g. a sequence motif), we have to learn from the data (e.g. maybe there are other differences between sequences that have the pattern and those that do not)

• This leads to Data mining and Machine learning

Page 3: Master’s course Bioinformatics Data Analysis and Tools

Contents:

•Markov chain models (1st order, higher order andinhomogeneous models; parameter estimation; classification)

• Interpolated Markov models (and back-off models)

• Hidden Markov models (forward, backward and Baum-Welch algorithms; model topologies; applications to genefinding and protein family modeling

A widely used machine learning approach: Markov models

Page 4: Master’s course Bioinformatics Data Analysis and Tools
Page 5: Master’s course Bioinformatics Data Analysis and Tools

Markov Chain Models

• a Markov chain model is defined by:– a set of states

• some states emit symbols

• other states (e.g. the begin state) are silent

– a set of transitions with associated probabilities• the transitions emanating from a given state define a

distribution over the possible next states

Page 6: Master’s course Bioinformatics Data Analysis and Tools

Preamble: Hidden Markov Chain Models

1. - observations x1,…,xn – sequence of observations

2. Q - states 1,…,n – hidden sequence of states

=(1,…,N)T - initial probability of states

3. A = (aij) – transition matrix

4. E = (ei(x)) – emission probabilities

Page 7: Master’s course Bioinformatics Data Analysis and Tools

A Simple HMM

Page 8: Master’s course Bioinformatics Data Analysis and Tools

Markov Chain Models• given some sequence x of length L, we can ask how

probable the sequence is given our model• for any probabilistic model of sequences, we can

write this probability as

• key property of a (1st order) Markov chain: the probability of each Xi depends only on Xi-1

Page 9: Master’s course Bioinformatics Data Analysis and Tools

Markov Chain Models

Pr(cggt) = Pr(c)Pr(g|c)Pr(g|g)Pr(t|g)

Page 10: Master’s course Bioinformatics Data Analysis and Tools

Markov Chain ModelsCan also have an end state, allowing the model to represent:

• Sequences of different lengths

• Preferences for sequences ending with particular symbols

Page 11: Master’s course Bioinformatics Data Analysis and Tools

Markov Chain Models

ii xxa1

)|Pr( 11 iixx xxa

ii

The transition parameters can be denoted by

where

Similarly we can denote the probability of a sequence x as

Where aBxi represents the transition from the begin state

Page 12: Master’s course Bioinformatics Data Analysis and Tools

Example Application• CpG islands

– CG dinucleotides are rarer in eukaryotic genomes than expected given the independent probabilities of C, G

– In human genome CpG (CG) is least frequent dinucleotide, because C in CpG is easily methylated and has the tendency to mutate into T afterwards.

– Methylation is suppressed around genes in a genome: CpG appears more frequently within these regions, called CpG islands.

– Particularly, the regions upstream of genes are richer in CG dinucleotides than elsewhere – CpG islands

– Identifying the CpG islands in a genome is important.– useful evidence for finding genes

• Could predict CpG islands with Markov chains– one to represent CpG islands– one to represent the rest of the genome

Example includes using Maximum likelihood and Bayes’ statistical data and feeding it to a HMM model

Page 13: Master’s course Bioinformatics Data Analysis and Tools

Estimating the Model Parameters

• Given some data (e.g. a set of sequences from CpG islands), how can we determine the probability parameters of our model?

• One approach: maximum likelihood estimation– given a set of data D

– set the parameters to maximize

Pr(D | )– i.e. make the data D look likely under the model

Page 14: Master’s course Bioinformatics Data Analysis and Tools

Maximum Likelihood Estimation• Suppose we want to estimate the parameters Pr(a), Pr(c),

Pr(g), Pr(t)

• And we’re given the sequences: accgcgctta gcttagtgactagccgttac

• Then the maximum likelihood estimates are:

Pr(a) = 6/30 = 0.2 Pr(g) = 7/30 = 0.233Pr(c) = 9/30 = 0.3 Pr(t) = 8/30 = 0.267

Page 15: Master’s course Bioinformatics Data Analysis and Tools
Page 16: Master’s course Bioinformatics Data Analysis and Tools
Page 17: Master’s course Bioinformatics Data Analysis and Tools
Page 18: Master’s course Bioinformatics Data Analysis and Tools
Page 19: Master’s course Bioinformatics Data Analysis and Tools

These dinucleotide frequency data are derived from genome sequences

Page 20: Master’s course Bioinformatics Data Analysis and Tools
Page 21: Master’s course Bioinformatics Data Analysis and Tools
Page 22: Master’s course Bioinformatics Data Analysis and Tools
Page 23: Master’s course Bioinformatics Data Analysis and Tools

Higher Order Markov Chains

• An nth order Markov chain over some alphabet is equivalent to a first order Markov chain over the alphabet of n-tuples

• Example: a 2nd order Markov model for DNA can be treated as a 1st order Markov model over alphabet:AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, and TT (i.e. all possible dipeptides)

Page 24: Master’s course Bioinformatics Data Analysis and Tools

A Fifth Order Markov Chain

Page 25: Master’s course Bioinformatics Data Analysis and Tools

Inhomogenous Markov Chains

• In the Markov chain models we have considered so far, the probabilities do not depend on where we are in a given sequence

• In an inhomogeneous Markov model, we can have different distributions at different positions in the sequence

• Consider modeling codons in protein coding regions

Page 26: Master’s course Bioinformatics Data Analysis and Tools

Inhomogenous Markov Chains

Page 27: Master’s course Bioinformatics Data Analysis and Tools

A Fifth Order InhomogenousMarkov Chain

Page 28: Master’s course Bioinformatics Data Analysis and Tools

Selecting the Order of aMarkov Chain Model

• Higher order models remember more “history”• Additional history can have predictive value• Example:

– predict the next word in this sentence fragment “…finish __” (up, it, first, last, …?)

– now predict it given more history

• “Fast guys finish __”

Page 29: Master’s course Bioinformatics Data Analysis and Tools

Selecting the Order of aMarkov Chain Model

• However, the number of parameters we need to estimate grows exponentially with the order– for modeling DNA we need parameters for an nth order model, with n 5 normally

• The higher the order, the less reliable we can expect our parameter estimates to be– estimating the parameters of a 2nd order homogenous Markov chain from the complete genome of E. Coli, we would see each word > 72,000 times on average– estimating the parameters of an 8th order chain, we would see each word ~ 5 times on average

Page 30: Master’s course Bioinformatics Data Analysis and Tools

Interpolated Markov Models

• The IMM idea: manage this trade-off by interpolating among models of various orders

• Simple linear interpolation:

Page 31: Master’s course Bioinformatics Data Analysis and Tools

Interpolated Markov Models

• We can make the weights depend on the history

– for a given order, we may have significantly more data to estimate some words as compared to others

• General linear interpolation

Page 32: Master’s course Bioinformatics Data Analysis and Tools

Gene Finding: Search by Content•Encoding a protein affects the statistical properties of a DNA sequence

– some amino acids are used more frequently than others (Leu more popular than Trp)– different numbers of codons for different amino

acids (Leu has 6, Trp has 1)– for a given amino acid, usually one codon is used

more frequently than others

•This is termed codon preference

•Codon preferences vary by species

Page 33: Master’s course Bioinformatics Data Analysis and Tools

Codon Preference in E. Coli

AA codon /100----------------------Gly GGG 1.89Gly GGA 0.44Gly GGU 52.99Gly GGC 34.55

Glu GAG 15.68Glu GAA 57.20

Asp GAU 21.63Asp GAC 43.26

Page 34: Master’s course Bioinformatics Data Analysis and Tools

• Common way to search by content– build Markov models of coding & noncoding

regions– apply models to ORFs (Open Reading Frames) or fixed-sized windows of sequence

• GeneMark [Borodovsky et al.]– popular system for identifying genes in bacterial

genomes– uses 5th order inhomogenous Markov chain

models

Search by Content

Page 35: Master’s course Bioinformatics Data Analysis and Tools

The GLIMMER System• Salzberg et al., 1998

• System for identifying genes in bacterial genomes

• Uses 8th order, inhomogeneous, interpolated Markov chain models

Page 36: Master’s course Bioinformatics Data Analysis and Tools

IMMs in GLIMMER• How does GLIMMER determine the values?• First, let us express the IMM probability

calculation recursively:

Page 37: Master’s course Bioinformatics Data Analysis and Tools

IMMs in GLIMMER

• If we haven’t seen xi-1… xi-n more than 400 times, then compare the counts for the following:

• Use a statistical test ( 2) to get a value d indicating our confidence that the distributions represented by the two sets of counts are different [2 = ((O - E)2/E)]

..sequence goes from left to right here…

Page 38: Master’s course Bioinformatics Data Analysis and Tools

IMMs in GLIMMER

2 score when comparing nth-order with n-1th-order Markov model (preceding slide)

Page 39: Master’s course Bioinformatics Data Analysis and Tools

The GLIMMER method

• 8th order IMM vs. 5th order Markov model

• Trained on 1168 genes (ORFs really)

• Tested on 1717 annotated (more or less known) genes

Page 40: Master’s course Bioinformatics Data Analysis and Tools

PPV

Page 41: Master’s course Bioinformatics Data Analysis and Tools

PPV

PPV

Page 42: Master’s course Bioinformatics Data Analysis and Tools

Plot sensitivity over 1-specificity

Page 43: Master’s course Bioinformatics Data Analysis and Tools
Page 44: Master’s course Bioinformatics Data Analysis and Tools
Page 45: Master’s course Bioinformatics Data Analysis and Tools

Hidden Markov models (HMMs)

Given say a T in our input sequence, which state emitted it?

Page 46: Master’s course Bioinformatics Data Analysis and Tools

A Simple HMM

Page 47: Master’s course Bioinformatics Data Analysis and Tools

Hidden Markov Models

1. - observations x1,…,xn – sequence of observations

2. Q - states 1,…,n – hidden sequence of states

=(1,…,N)T - initial probability of states

3. A = (aij) – transition matrix

4. E = (ei(x)) – emission probabilities

Page 48: Master’s course Bioinformatics Data Analysis and Tools

Hidden Markov models (HMMs)

Hidden State

• We will distinguish between the observed parts of a problem and the hidden parts• In the Markov models we have considered previously, it is clear which state accounts for each part of the observed sequence • In the model above (preceding slide), there are multiple states that could account for each part of the observed sequence– this is the hidden part of the problem– states are decoupled from sequence symbols

Page 49: Master’s course Bioinformatics Data Analysis and Tools

HMM-based homology searching

HMM for ungapped alignment…

Transition probabilities and Emission probabilities

Gapped HMMs also have insertion and deletion states (next slide)

Page 50: Master’s course Bioinformatics Data Analysis and Tools

Profile HMM: m=match state, I-insert state, d=delete state; go from left to right. I and m states output amino acids; d states are ‘silent”.

d1 d2 d3 d4

I0 I2 I3 I4I1

m0 m1 m2 m3 m4 m5

Start End

Model for alignment with insertions and deletions

Page 51: Master’s course Bioinformatics Data Analysis and Tools

HMM-based homology searching

• Most widely used HMM-based profile searching tools currently are SAM-T2K (Karplus et al., 1998) and HMMER2 (Eddy, 1998)

• formal probabilistic basis and consistent theory behind gap and insertion scores

• HMMs good for profile searches, but still not good for alignment (due to parametrization of the models)

• HMMs are slow

Page 52: Master’s course Bioinformatics Data Analysis and Tools

Homology-derived Secondary Structure of Proteins

(HSSP) Sander & Schneider, 1991

It’s all about trying to push the “don’t know region” down…

Page 53: Master’s course Bioinformatics Data Analysis and Tools

The Parameters of an HMM

Page 54: Master’s course Bioinformatics Data Analysis and Tools

HMM for Eukaryotic Gene Finding

Figure from A. Krogh, An Introduction to Hidden Markov Models for Biological Sequences

Page 55: Master’s course Bioinformatics Data Analysis and Tools

A Simple HMM

Page 56: Master’s course Bioinformatics Data Analysis and Tools

Three Important Questions

• How likely is a given sequence?the Forward algorithm

• What is the most probable “path” for generating a given sequence?

the Viterbi algorithm

• • How can we learn the HMM parameters given a set of sequences?the Forward-Backward (Baum-Welch) algorithm

Page 57: Master’s course Bioinformatics Data Analysis and Tools

Three basic problems of HMMsOnce we have an HMM, there are three problems of interest.

(1) The Evaluation Problem Given an HMM and a sequence of observations, what is the

probability that the observations are generated by the model?

(2) The Decoding Problem Given a model and a sequence of observations, what is the

most likely state sequence in the model that produced the observations?

(3) The Learning Problem Given a model and a sequence of observations, how should we

adjust the model parameters in order to maximize

Evaluation problem can be used for isolated (word) recognition. Decoding problem is related to the continuous recognition as well as to the segmentation. Learning problem must be solved, if we want to train an HMM for the subsequent use of recognition tasks.

Page 58: Master’s course Bioinformatics Data Analysis and Tools

Three main problems

i) Given A+E what is the probability of (x1, ...,xn)? forward algorithm

ii) Given A+E+(x1, ...,xn) what is (π1, ..., πn)? Viterbi algorithm

iii) Given (x1, ...,xn) what is A+E? Baum-Welch algorithm

iiii) Given A+E+(x1, ...,xn) what is the most probable state at step i? forward-backward algorithm

Page 59: Master’s course Bioinformatics Data Analysis and Tools

Three Important Questions

• How likely is a given sequence?

Forward algorithm

• What is the most probable “path” for generating a given sequence?

• How can we learn the HMM parameters given a set of sequences?

Page 60: Master’s course Bioinformatics Data Analysis and Tools

How Likely is a Given Sequence?

• The probability that the path is taken and the sequence is generated:

• (assuming begin/end are the only silent states on path)

Page 61: Master’s course Bioinformatics Data Analysis and Tools

How Likely is a Given Sequence?

Page 62: Master’s course Bioinformatics Data Analysis and Tools

How Likely is a Given Sequence?

The probability over all paths is:

but the number of paths can be exponential in the length of the sequence...

• the Forward algorithm enables us to compute this efficiently

Page 63: Master’s course Bioinformatics Data Analysis and Tools

How Likely is a Given Sequence:The Forward Algorithm

• Define fk(i) to be the probability of being in state k

• Having observed the first i characters of x we want to compute fN(L), the probability of being in the end state having observed all of x

• We can define this recursively

Page 64: Master’s course Bioinformatics Data Analysis and Tools

How Likely is a Given Sequence:

Page 65: Master’s course Bioinformatics Data Analysis and Tools

The forward algorithm• Initialisation:

f0(0) = 1 (start),

fk(0) = 0 (other silent states k)

• Recursion: fl(i) = el(i)k fk(i-1)akl (emitting states),

fl(i) = k fk(i)akl (silent states)

• Termination:

Pr(x) = Pr(x1…xL) = f N(L) = k fk(L)akN probability that we are in the end state and have observed the entire sequence

probability that we’re in start state and have observed 0 characters from the sequence

Page 66: Master’s course Bioinformatics Data Analysis and Tools

Forward algorithm example

..all the time calculate over all possible ways to get to a considered state

Page 67: Master’s course Bioinformatics Data Analysis and Tools

Three Important Questions

• How likely is a given sequence?

• What is the most probable “path” for generating a given sequence?

Viterbi algorithm

• How can we learn the HMM parameters given a set of sequences?

Page 68: Master’s course Bioinformatics Data Analysis and Tools

Finding the Most Probable Path:The Viterbi Algorithm

• Define vk(i) to be the probability of the most probable path accounting for the first i characters of x and ending in state k

• We want to compute vN(L), the probability of the most probable path accounting for all of the sequence and ending in the end state

• Can be defined recursively• Can use DP to find vN(L) efficiently

Page 69: Master’s course Bioinformatics Data Analysis and Tools

Finding the Most Probable Path:The Viterbi Algorithm

•Andrew Viterbi used Manhattan grid model to solve this Decoding problem.

•Every choice of Every choice of π = π1 … πn corresponds to a path in the graph.

•Only valid direction in the graph is eastward.

•This graph has N2(n-1) edges, where N is number of states

Page 70: Master’s course Bioinformatics Data Analysis and Tools

Finding the Most Probable Path:The Viterbi Algorithm

Manhattan grid model

Page 71: Master’s course Bioinformatics Data Analysis and Tools

Finding the Most Probable Path:The Viterbi Algorithm

Viterbi – recursive stepWhat is the probability of the path which ends with qA->qB and emission EB?

Page 72: Master’s course Bioinformatics Data Analysis and Tools

Finding the Most Probable Path:The Viterbi Algorithm

Viterbi – recursive stepWhat is the most probable path to the state B in step i?

V(i ,q)= eq(xi) max⋅ sV(i−1,s) a⋅ s,q

Page 73: Master’s course Bioinformatics Data Analysis and Tools

Finding the Most Probable Path:The Viterbi Algorithm

Viterbi – recursive stepdef viterbi(x_seq, Q, fi, A, E):

T = {}for q in Q:

T[q] = (fi[q], [q])for x in x_seq:

U = {}for q_next in X:

max_path = None; max_p = 0for q in X:

(prob, path) = T[q]p = prob * A[q][q_next] * E[q_next][x]if p > max_p:

max_p = p;max_path = path+[q_next]

U[q_next] = (max_p, max_path)T = U

Page 74: Master’s course Bioinformatics Data Analysis and Tools

Finding the Most Probable Path:The Viterbi Algorithm

Initialisation:

v0(0) = 1 (start), vk(0) = 0 (non-silent states)

Recursion for emitting states (i =1…L):

Recursion for silent states:

Page 75: Master’s course Bioinformatics Data Analysis and Tools

Finding the Most Probable Path:The Viterbi Algorithm

Page 76: Master’s course Bioinformatics Data Analysis and Tools

Three Important Questions

• How likely is a given sequence? (clustering)

• What is the most probable “path” for generating a given sequence? (alignment)

• How can we learn the HMM parameters given a set of sequences?

The Baum-Welch Algorithm

Page 77: Master’s course Bioinformatics Data Analysis and Tools

The Learning Problem

Generally, the learning problem is how to adjust the HMM parameters, so that the given set of observations (called the training set) is represented by the model in the best way for the intended application.

Thus it would be clear that the ``quantity'' we wish to optimize during the learning process can be different from application to application. In other words there may be several optimization criteria for learning, out of which a suitable one is selected depending on the application.

There are two main optimization criteria found in the literature; Maximum Likelihood (ML) and Maximum Mutual Information (MMI).

Page 78: Master’s course Bioinformatics Data Analysis and Tools

The Learning Task• Given:

– a model– a set of sequences (the training set)

• Do:– find the most likely parameters to explain the training sequences

• The goal is find a model that generalizes well to sequences we haven’t seen before

Page 79: Master’s course Bioinformatics Data Analysis and Tools

Learning Parameters

• If we know the state path for each training sequence, learning the model parameters is simple– no hidden state during training– count how often each parameter is used– normalize/smooth to get probabilities– process just like it was for Markov chain models

• If we don’t know the path for each training sequence, how can we determine the counts?– key insight: estimate the counts by considering every

path weighted by its probability

Page 80: Master’s course Bioinformatics Data Analysis and Tools

Learning Parameters:The Baum-Welch Algorithm

• An EM (expectation maximization) approach, a forward-backward algorithm

• Algorithm sketch:– initialize parameters of model– iterate until convergence

• Calculate the expected number of times each transition or emission is used

• Adjust the parameters to maximize the likelihood of these expected values

• Baum-Welch has as important feature that it always converges

Page 81: Master’s course Bioinformatics Data Analysis and Tools

The Expectation step

Page 82: Master’s course Bioinformatics Data Analysis and Tools

The Expectation step

Page 83: Master’s course Bioinformatics Data Analysis and Tools

The Expectation step

Page 84: Master’s course Bioinformatics Data Analysis and Tools

The Expectation step

Page 85: Master’s course Bioinformatics Data Analysis and Tools

The Expectation step

• First, we need to know the probability of the i th symbol being produced by state q, given sequence x:

Pr( i = k | x)

•Given this we can compute our expected counts for state transitions, character emissions

Page 86: Master’s course Bioinformatics Data Analysis and Tools

The Expectation step

Page 87: Master’s course Bioinformatics Data Analysis and Tools

The Backward Algorithm

Page 88: Master’s course Bioinformatics Data Analysis and Tools

The Expectation step

Page 89: Master’s course Bioinformatics Data Analysis and Tools

The Expectation step

Page 90: Master’s course Bioinformatics Data Analysis and Tools

The Expectation step

Page 91: Master’s course Bioinformatics Data Analysis and Tools

The Maximization step

..make it a probability through dividing by the total number of emissions out of state k

Page 92: Master’s course Bioinformatics Data Analysis and Tools

The Maximization step

..make it a probability through dividing by the total number of transitions out of state k

Page 93: Master’s course Bioinformatics Data Analysis and Tools

The Baum-Welch Algorithm

• Initialize parameters of model

• Iterate until convergence– calculate the expected number of times each transition or emission is used– adjust the parameters to maximize the likelihood of these expected values

• This algorithm will converge to a local maximum (in the likelihood of the data given the model)

• Usually in a fairly small number of iterations

Page 94: Master’s course Bioinformatics Data Analysis and Tools

References

• bioalgorithms.info• Wikipedia• Durbin book

Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids by Richard Durbin, Sean R. Eddy, Anders Krogh, Graeme Mitchison, Cambridge University Press, ISBN 0521 62971 3 Pbk