chap 10. gene prediction. sequence-based gene discovery good for prokaryote eukaryotes no...

Chap 10. Gene Prediction

Sequence-based Gene Discovery

Good for prokaryote Eukaryotes

No Shine-Dalgarno sequence to mark 1st start codon

TSS in eukaryotes may not be well-defined Most ORFs have multiple exons The only info on the exon-Intron boundary is

GT..AG consensus

Content-based Gene Prediction

EasyGene: Codon freq is used for gene prediction and

exon/intron boundary Search for CpG islands

Not precise

• CG is the least frequent di-mer sequence• C in CG is easily methylated, can methyl-C tends to

mutate to T• Methylation is often suppressed around genes in CG

islands• Find CG islands in long DNA sequences

• Calculate log-odds ratios of a sliding window of a certain length

• And declare a CG islands if score is positive

• Disadvantage of the approach• Do not have info of the CG island length in advance• => use HMM

CG Islands

• Markov Model (MM)• Examine correlation in sequences• In a long sequence, suppose AA a is observed na times

and that AA a is followed by b for nab times• rab = P(xi=b |xi-1=a) = nab/ na

• First-order Markov model of a sequence is defined by• An alphabet• A matrix of conditional probs. rab

• A set of frequencies for initial state, qa

• Likelihood of a sequence x1, x2, ….xN according to the 1st order model is• L = qx(1) ∏N i=2 rx(i-1) x(i)

• If no correlation, rab = qb and L = ∏N qx(i) (zero-order MM)

• kth-order MM• K=2, rabc = P(xi=c |xi-1=b, xi-2=a) = nabc/ nab

• A letter is dependent upon preceding letters

Probability-based Models

• Dealer uses a fair coin, but occasionally switch to a biased coin

• Given a sequence of coin tosses, determine when the dealer used a fair/biased coin

• For n tosses with sequence x = x1 x2… xn

• P(x|fair) = ∏n i=1 (1/2)**n• P(x|biased) = ∏n i=1 (q)**k (1-q)**(n-k)• Log-odds ratio

• R = log P(x|fair)/P(x|biased) • q = ¾, R = n – k*log3

• If R<0, biased coin

Fair Bet Casino Problem

• An abstract machine emitting symbols• At each discrete steps, HMM makes two decision

• What is the next state• What symbol to emit

Hidden Markov Model (HMM)

F B0.1

0.1

H T H T

• Given a path P= FFFBBBBBFFF• And output x=01011101001• P(x|P) = ½ ½ ½ ¾ ¾ ¾ ¼ ¾ ½ ½ ½

• P(P) = ½ 9/10 9/10 1/10 9/10 9/10 9/10 ….

• Find a path that maximizes P(x|P) over all possible P

Coin Toss

• A letter depends on preceding letters AND on hidden state• In helical/loop problem, two hidden states: 0 for loop,

1 for helical• rB0, rB1 : probs. that 1st residue is loop or helical• r10, r11 : probs. of remaining in helical, or

switching to loop• etc.

HMM of Loop/Helical

• Emission probs set to identical to AA frequencies• eo(a) = pl, e1(a) = ph

• AAs occur independently as long as staying in either state 0 or 1 (zero-order)

• Transition probs. Between hidden states is modeled by 1st-order• Values of transition probs. (r11, r10 ,…) control the

relative frequency and relative lengths of the regions• If r01 is very small, difficult to initiate a new helical

region, etc.• e.g. Sequence xi: GHMESSLL KQT I NSWHLN

path pi : B001 111000011110000E


• Path variables pi describe the hidden states• Likelihood of the path pi in the example,

• L = [rB0 eo(G)] [r00 eo(H)] [r01 e1(M)] …. [r00 eo(T)] r0E

• Model can be used to determine the most likely positions of helices and loops within the sequence (called decoding problem)• Two ways of doing this

• Viterbi• Find the most probably path through the model,

i.e., find the sequence of hidden states with the highest L

• This gives a straightforward prediction that each site is either helix or loop

• Forward/Backward• Consider all possible paths through the model,

weighted according to their likelihood, and calculate the prob. that each site is in each of the hidden state


• The most probably paths depend on emission and transition probs.

• Parameters are determined from known structure info, and from these we can calculate ML values of the probs.

• As in Profile Model, ML values of emission frequencies will be given by the observed frequencies• The simplest way of choosing pa is to use na/ntot (indeed,

maximizes L)• If AA a occurs nka times in regions of state k, and the

total number of residues in state k is nktot,

• ek(a) = nka /nktot

• If state j follows state k mkj times,• rkj = mkj /nk

tot (1st-order model)• If some transitions occur very rarely, or not at all, it is

best to use prior info in choosing the frequencies (by adding pseudo-counts)

HMM Parameters

• If parameters are chosen from a set of known examples, this is referred to as supervised learning

• Unsupervised learning maximizes the likelihood within the framework of the model, but without being told what parameters to learn• We can define a model with two hidden states,

defined by two sets of frequencies, without specifying the meaning of the two states

• The learning process will then determine the best way of partitioning the sequence into two different types of subsequences

• The simplest way of implementing HMM with unsupervised learning is Viterbi training

HMM Parameters

• Start with an initial guess as to the model parameters• Calculate the values of hidden states on the most

probable path for each sequence in the training set• From this, calculate nka and mkj in most probable path

Viterbi Algorithm

• Given sequence xi,• vk(i): likelihood of the most probable path for the

first i letters in the sequence, given that the i-th letter is in state k

• Initialize vk(1) = rBk ek(x1)• And vk(i) = maxh[vk(i-1) rhk ek(xi)] for i=2,…,N• vE(i) = maxh[vh(N) rhE] : likelihood of the best total path

• A dynamic programming.

Viterbi Algorithm

• Expectation-maximization, forward/backward• Expectation: calculate prob. P(pi =k) that site i is in

state k• Then, the expected value of letter a appearing in

state k, averaged over all possible paths, E[nka] = ∑ P(pi =k) over all probable paths with xi =a

• Also, E[mkj] = ∑i P(pi =k, pi+1 =j)• Maximization: use E[nka] and E[mkj] into ek(a) = nka /nk

tot, rkj = mkj /nk

tot

• Expectation-maximization is repeated until no change

• P(pi =k) ?

Baum-Welch Algorithm

• P(pi =k) ?• Forward

• fk(1) = rBk ek(x1)• fk(i) = ek(xi)∑hfh(i-1) rhk (i=2,…,N)• Ltot = ∑hfh(N) rhE

• Backward: bk(N) – sum of likelihoods of all paths from xi+1 to N • bk(N) = rkE

• bk(i) =∑h rhk eh(xi) bh(i+1) (i=2,…,N)• Ltot = ∑hrBh eh(xi) bh(1)

• Both Ltot has to be identical

• ∑ P(pi =k) = fk(i) bk(i)/ Ltot

• ∑i P(pi =k, pi+1 =j) = fk(i) rkj ej(xi+1) bj(i+1)/ Ltot

Baum-Welch Algorithm

• Acceptor sequence• [TC]…[TC][N][CT][A][G][G]

• Donor sequence• [CA][A][G][G][T][AG][A][G]

• 175 main states, training 1000 samples each

Intron Sites

• Acceptor sequence• [TC]…[TC][N][CT][A][G]

[G]

• Donor sequence• [CA][A][G][G][T][AG]

[A][G]

• Trained with 500 internal exons

• Periodicity of exon distribution• A+G in phase• C+T in antiphase

Exon

• M1-M0 model• 1st-order transition between hidden states, 0-order

independent letters• Also, ek(a) is dependent on hidden state k, but not

on previous letter• Example of M1-M0 model

• Occasionally dishonest casino, Durbin et al (1998)

Helical/Loop Example

• HMM model• Krogh et al. (2001)

Helical/Loop Example

Coiled Coil Example• Coiled coils are

associations of two or more α helices that wrap around each other• Found in many proteins,

tropomyosin, hemagglutinin (influenza virus), DNA-binding transcription factors

• About 3.5 residues per turn, leading to a repeating pattterns of seven residues (a heptad) in two turns

• Two helices attract one another due to hydrophobic residues at sites a and d

• Lupas, Vandyke, and Stock (1991)• Developed a profile score system dependent on the

relative amino acid frequencies at each site (similar to Lprofile = ∏N pix and S = ln(Lpf/ L0) = ∑Nln(pix / px) in profile model)

• Used a sliding window of 28 residues (four heptads)

Coiled Coil Example

• Delorenzi and Speed (2002)• HMM with 9 groups of

states and Beg/End

• Each of 9 groups contains seven states representing seven possible positions in the helix• States in one group are

linked to the state at the following helix position in the next group

• Profile technique• position-specific scores are used to describe aligned

families of protein sequences• Drawback is the reliance on ad hoc scoring schemes

• Profile HMM is developed to capture the info in an alignments

Profile HMM

1 2 3W H . . E nW H . . Y .W - . . E .S H . . E .T H e . Y .W H e r E .

• Simulate human nerve system• Neurons and synapse• Neuron puts out a real

number between 0 and 1

• Feedforward network• Typically 10-20 residues

are input• Usually used in supervised

learning

Neural Networks

• Connection from input to a neuron has positive/negative weight wij

• Total input xj = ∑iwij yi

• Output yj = g(xj)

• A sigmoid function: g(xj) = 1/[1 + exp(-xj)]• Single output with multiple inputs is called a perceptron

Single Neuron

• Two inputs, one output• Trained by

• Total input xj = w1 y1 + w2 y2 + w0 (w0 is bias)

• Assume a step function for g(xj)

Perceptron Example

(y1 y2) → y(0, ½) 1(1,1)

1(1,1/2) 0(0,0)

0

w2 /2 + w0 > 0w1 + w2 + w0 > 0w1 + w2/2 + w0 < 0w0 < 0

• Visualize

• Can pick y2 = ¼ + 1/2 y1

• -1/4 - 1/2 y1 + y2 > 0• w1 = -¼, w2 = -1/2, w0 = 1

Perceptron Example

w2 /2 + w0 > 0w1 + w2 + w0 > 0w1 + w2/2 + w0 < 0w0 < 0

• Backpropagation• Error at the output

layer percolate down to the input layer

• Weights are adjusted• Based on gradient

descent method

Learning Algorithm

• Protein structure prediction by PROF• Input layer

• Sliding 15-residue window• Predict secondary structure of the central

residue• One residue has 20 input nodes

• Hidden layer• Connected to ALL input and output nodes

NN Application

• Intron prediction• Intron splice site spans 15-60 nt

• Organisms have unique codon usages at donor sites

NN Application

• Suppose you want to search for a pattern in data• In general, easy to spot or program• When a particular region has 55% GC content

whereas neighboring regions have only 45%, easy to miss the regions with differences

• Even when differences are noted, difficult to set the boundaries

• Different regions may have biological significance, with one region representing a coding region

• GC content may be of interest, but what about the content that may be significant but is not recognized yet ?

• Probabilistic method may work• In particular, machine learning

Pattern Matching

• Machine Learning model• A blackbox with inputs and outputs• Blackbox is adjusted by parameters• Simplest form – yes or no output

• Model parameters are random• Outputs will be random also, but can be

trained to adjust parameters to fit the data• Typically, known data set is divided into a

training set and a test set• Models

• Probabilistic models• Hidden Markov Model• Neural Network• SVM

Machine Learning

• Likelihood ratios• Example: predict helices and loops in a protein

• Known info: helices have a high content of hydrophobic residues

• ph and pl: frequencies of AA being in the helix or loop

• Lh and Ll : likelihoods that a sequence of N AAs are in a helix or a loop• Lh = ∏N ph , Ll = ∏N pl

• Rather than likelihoods, their ratios have more info• Lh/Ll : is sequence more or less likely to be a

helical or loop region• S = ln(Lh/Ll) = ∑

N ln(ph/pl): positive for helical region• Partition a sequence into N-AA segments (N=300)

Probabilistic Models

• Previous example has two hypotheses (Helix or Loop)• The sequence is described by models 0 and 1• Models 0 and 1 are defined by ph and pl

• Generalize to k hypotheses: Mk models (k=0,1,2,…)• Given a test dataset D, what is the prob. that D is

described by each of the models ?• Known info: prior probs., Pprior(Mk) for each modelfrom

other info sources• Compute likelihood of D according to each of the

models: L(D|Mk)

• Of interest is not the prob of D arising from Mk but the prob of D being described by Mk

• Namely, Ppost(Mk| D) ∞ L(D|Mk) Pprior(Mk) : posterior prob.

• Ppost(Mk| D) = L(D|Mk) Pprior(Mk)/∑iL(D|ii) Pprior(Mi)• => Bayesian prob.

Prior and Posterior Probs.

• Basic principles• We make inference using posterior probs.• If a posterior prob. of one model is higher, it can be the best

model with confidence• Special case: two models

• Two prior probs.: Pprior0 , Pprior

1

• Pposti = Li Pprior

i/(L0 Pprior0 + L1 Pprior

1)• Log-odd score:

S΄ = ln(L1Pprior1/L0Pprior

0) = ln(L1/L0) + ln(Pprior1/Pprior

0)

= S + ln(Pprior1/Pprior

0) • Difference between S΄and S is simply the additive constant,

and ranking will be identical whether we use S΄or S• Warning: if Pprior

1 is small, S has to be high to make S΄positive

• When Pprior0 = Pprior

1, S΄= S

• Ppost1 = 1/(1 + L0 Pprior

0 /L1 Pprior1) = 1/(1 + exp(- S΄))

• S΄=0 →Ppost1 =1/2; S΄is large and negative → Ppost

1 ≈1

Bayesian Prob.

• How to estimate parameters from data when they are continuous ?• In helical/loop example, how to determine ph and pl?

• In the same example, let AA a is observed na times from the sequence of ntot AAs• The likelihood of this occurring is L = ∏20 pa

na

• The simplest way of choosing pa is to use na/ntot (indeed, maximizes L)

• Generalize to a prob. model for a profile• Want to develop a position-specific score system • K aligned sequences of length N (ungapped)• ML frequencies at site i are pia = nia/Ki

• Let xi be AA at site i in a new sequence and its likelihood Lprofile = ∏N pix (note: should be xi not x)

• Need a model to compare with:• the one that is not position-specific: L0 = ∏N pa

• S = ln(Lprofile/ L0) = ∑Nln(pix / px)

Profile Model

• Given a model of sequence evolution and a proposed tree structure, compute the likelihood that the known sequences would have evolved on that tree

• ML chooses the tree that maximizes this likelihood

• Three parameters• Tree toplogy• Branch lengths• Values of the

parameters in the rate matrix

Maximum Likelihood (ML) Phylogeny

• Given a model of sequence evolution at a site

• Likelihood of ancestor X: L(X) = PXA(t1) PXG(t2)

• L(Y) = PYG(t4) ∑X L(X) PYX(t3)• L(W) = ∑y ∑Z

L(Y)PWY(t5)L(Z)PWZ(t6)

• Total likelihood for the site:• L = ∑W p W L(W)• p W: equilibrium prob.• Is equal to posterior

prob. of different clades

What is Likelihood in ML Tree ?

X

A

Y

W

Z

G G T T

t1

t3

t5

t2

t4

t6

• We need to generate a large sample of trees with prob. of finding a tree in the sample being proportional to its likelihood*prior prob.

• MCMC is to generate such trees• Metropolis Algorithm

• Start with a trial tree and compute likelihood, L1

• Make a slight change (change branch length, move a vertex, ..)

• Compute L2 in the modified tree• If L2 >L1 new tree is accepted• Otherwise, new tree is accepted with prob.

L2 /L1

• if rejected, start with tree of L1

• Hill-climbing, and also downhill moves• Generates trees with percentages

Markov Chain Monte Carlo (MCMC) Tree

chap 10. gene prediction. sequence-based gene discovery good for prokaryote eukaryotes no...

Documents

n pxbiased

long sequence

sequence x1

qx1 n i

cg islandsfind cg islands

sequence of coin tosses

y1 y2 0w1

fairbiased coinfor n