meme • gibbs sampling • roc • modular hmms • gene findingstochastic version of meme. (1)...

85
Sequence Analysis '18 lecture 20 • MEME Gibbs sampling • K-means • ROC modular HMMs Gene finding

Upload: others

Post on 11-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Sequence Analysis '18 lecture 20

• MEME• Gibbs sampling• K-means• ROC• modular HMMs• Gene finding

Page 2: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Why are there Motifs?Selective pressure for:structure -- protein motifs folding units

fibrous proteins coiled coils transmembrane helicesfunction -- protein motifs

active sitebinding motifssignal sequences

expression -- DNA motifs transcription regulation chromatin binding

Page 3: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Zinc finger motif x x x x x x x x x x x x C H x \ / x x Zn x x / \ x C H x x x x x x x x x x

C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H

two Cystines separated by 2 or 4 residues

Loop must be length 12. 4th position in loop must be hydrophobic

two Histidines separated by 3 or 5 residues

Example: selection for structure

Page 4: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

ER targeting sequence[KRHQSA]-[DENQ]-E-L

N-glycosylationN-{P}-[ST]-{P}

Tyrosine phosphorylation[RK]-x(2)-[DE]-x(3)-Y or [RK]-x(3)-[DE]-x(2)-Y

C-{DENQ}-[LIVM]-x

C-terminal prenylation

Example: selection for function

Page 5: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

What if you don’t know the pattern?

...and you also don't know where it is....?

�5

Page 6: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

MEMEmotif elucidation by

maximizing expectation

• T. L. Bailey & C. Elkan, "Fitting a mixture model by expectation maximization to

discover motifs in biopolymers", ISMB, 2:28--36, 1994

1.Calculate the motif model given the locations of the motif.

2.Calculate the locations of the motif given the motif model.

3.Repeat until converged.

Page 7: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Initial guess of motif location

AGCTAGCTTCTCGTGA

TCTCGAGTGGCGCATG

TATTGCTCTCCGCAGC

From the motif locations, you make a profile model.

TMotif Model:L=4

1 2 3 4

C C G

initial guesses underlined

P1 = 2/3 T, 1/3 G

MEME

...and therefore of the motif

Page 8: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Calculate the probability score for each position

AGCTAGCTTCTCGTGA

From the profile model and the sequence, get probability scores.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

P = P1(A)P2(G)P3(C)P4(T)=(0)(.33)(.67)(0.)=0.

MEME

TC CG

Page 9: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Calculate the probability score for each position

AGCTAGCTTCTCGTGA

Slide the model along the sequence to get the next score.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

P = P1(G)P2(C)P3(T)P4(A)=(.33)(.67)(.33)(0.)=0.

MEME

TC CG

Page 10: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Calculate the probability score for each position

AGCTAGCTTCTCGTGA

Slide the model along the sequence to get the next score.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

P = P1(C)P2(T)P3(A)P4(G)=(.0)(.0)(.0)(0.67)=0.

MEME

TC CG

Page 11: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Calculate the probability score for each position

AGCTAGCTTCTCGTGA

Slide the model along the sequence to get the next score.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

P = P1(T)P2(C)P3(T)P4(C)=(.67)(.67)(.33)(0.33)=0.05

0.

0.1

MEME

TC CG

Page 12: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Calculate the probability score for each position

AGCTAGCTTCTCGTGA

Slide the model along the sequence to get the next score.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

P = P1(G)P2(T)P3(G)P4(A)=(.33)(.0)(.0)(0.0)=0.

MEME

TC CG

Page 13: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Calculate the probability score for each position

AGCTAGCTTCTCGTGA

Do every sequence.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

TCTCGAGTGGCGCATG

MEME

TC CG

Page 14: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Calculate the probability score for each position

AGCTAGCTTCTCGTGA

Do every sequence.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

TG G

C CT

GC

TCTCGAGTGGCGCATG

TATTGCTCTCCGCAGC

MEME

Page 15: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Re-Calculate the motif model

AGCTAGCTTCTCGTGA TCTCGAGTGGCGCATG

TATTGCTCTCCGCAGC 1.0 TCTC 0.5 TCTC 0.5 GGCG0.1 GCTC0.3 TCTC0.6 TCCG

Probabilities are normalized to sum to one for each sequence, since we expect exactly one motif per sequence.

The new model is the profile built from the hits.

MEME

see next slide...

TC

Page 16: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Recalculating the profile from the hits

1.0 TCTC 0.5 TCTC 0.5 GGCG0.1 GCTC0.3 TCTC0.6 TCCG

P1(T) = the probability of T in the first position = the sum of the scores for sequences with T in the first position, normalized.

P1(T) =1.0+0.5+0.3+0.61.0+0.5+0.5+0.1+0.3+0.6

= 0.8

0.80.2

MEME

TC

Page 17: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Do it again: Re-Calculate the probability scores

AGCTAGCTTCTCGTGA TCTCGAGTGGCGCATG

TATTGCTCTCCGCAGC 1.0 TCTC 0.9 TCTC 0.1 GGCG0.1 GCTC0.6 TCTC0.3 TCCGThe new model is the profile built

from the hits.

using the refined model

MEME

TC

TC

Page 18: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

...and again, until converged.

AGCTAGCTTCTCGTGA TCTCGAGTGGCGCATG

TATTGCTCTCCGCAGC 1.0 TCTC 1.0 TCTC 0.0 GGCG0.0 GCTC0.95 TCTC0.05 TCCG

TG

G

CC

TG

C

MEME

TC

Page 19: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

EM converges on the conserved pattern if the initial guess was not too far off.

TG

G

CC

TG

C

If the true motif was not one of the initial guesses, or some combination of the initial guesses, then EM would never find the true motif.

A summary of the exercise:

MEME

AGCTAGCTTCTCGTGA

TCTCGAGTGGCGCATG

TATTGCTCTCCGCAGC

TCTCTC CG

Page 20: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Pseudocounts, just in case

1.0 TCTC 0.5 TCTC 0.5 GGCG0.1 GCTC0.3 TCTC0.6 TCCG

No A is observed in the first position, but if we set P(A) = 0, then we “rule out” a motif with A in the first position. Instead, P1(A) = a small pseudocount value / sum of the weights.

This is especially important in the initial guesses, so that the true motif is not missed.

P1(T) = ε1.0+0.5+0.5+0.1+0.3+0.6

= 0.8

MEME

Pseudocounts may be decreased or removed (ε=0) in later stages.

TC CG

Page 21: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Gibbs SamplingStochastic version of MEME.

GIBBS

Radius of convergence is wider than MEME. Doesn’t need to start with one correct guess.

Page 22: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

GIBBS samplingmotif elucidation by

maximizing expectation

• T. L. Bailey & C. Elkan, "Fitting a mixture model by expectation maximization to

discover motifs in biopolymers", ISMB, 2:28--36, 1994

1.Calculate the motif model given the locations of the motif.

2.Calculate the locations of the motif given the motif model.

3.Repeat until converged.

Page 23: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

AGCTAGCTTCTCGTGA

TCTCGAGTGGCGCATG

TATTGCTCTCCGCAGC

score

aligned position

Slide first sequence through the motif window, calculate score.

GIBBS

keep scoring window fixedmove sequence

Expectation stepStart from random alignment. Select window size and position. Slide one sequence through window. Calculate scores.

Page 24: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Expectation step

AGCTAGCTTCTCGTGA

TCTCGAGTGGCGCATG

TATTGCTCTCCGCAGC

score

aligned position

GIBBS

Page 25: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Example AGCTAGCTTCTCGTGA

TCTCGAGTGGCGCATG

TATTGCTCTCCGCAGC

score

aligned position

GIBBS

Select an aligned position at random from the score distribution.

Do next sequence, and so on, cycling through the sequences many times.

Page 26: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Example AGCTAGCTTCTCGTGA

TCTCGAGTGGCGCATG

TATTGCTCTCCGCAGC

score

aligned position

GIBBS

Select an aligned position at random from the score distribution.

Do next sequence, and so on, cycling through the sequences many times.

Page 27: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Convergence is when there are no more changes.

AGCTAGCTTCTCGTGA

TCTCGAGTGGCGCATG

TATTGCTCTCCGCAGC

GIBBS

Exactly one segment is aligned to the motif region at each step.

Page 28: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Gibbs SamplingStochastic version of MEME.

(1) Choose initial (or random) guesses of motif locations.

(2) Calculate the motif model (w/ or w/o pseudocounts/noise) from the current motif positions.

(3) Slide one sequence against the model. Calculate probability score at each position.

(4) Randomly choose a motif position from the probability distribution.

(5) Repeat until converged.

GIBBS

Radius of convergence is wider than MEME. Doesn’t need to start with a correct guess.

Page 29: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Transcription factor binding sitesExample:

We know that upstream of every gene that expresses in the presence of lysine, there must be a signal for a transcription factor (TF). But we don’t know which TF and we don’t know where it binds.

Page 30: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Transcription factor binding site

NOTE: It’s palindromic in this case. Palindromy in TF footprints (binding sites) is due to the symmetry of the TFs, which are almost invariably dimeric.

Example: selection for expression

This is what we want to find

a pattern, motif, signature, footprint.

a geneupstream of a gene

Page 31: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

What is Expectation/Maximization ?

EM is any method that iterates between an “expectation” step and a “maximization” step. Starting with a statistical model and a set of data.

•Expectation Calculate the expected values for the parameters of the model, using the current model and the data.

•Maximization Replace the parameters of the model with their expected values.

MEME is an EM algorithm. Gibbs sampling is not.

Page 32: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Finding recurrent sequences by Kmeans clustering

HDFPIEGGDSPMQTIFFWSNANAKLSHGY CPYDNIWMQTIFFNQSAAVYSVLHLIFLT IDMNPQGSIEMQTIFFGYAESA ELSPVVNFLEEMQTIFFISGFTQTANSD INWGSMQTIFFEEWQLMNVMDKIPS IFNESKKKGIAMQTIFFILSGR PPPMQTIFFVIVNYNESKHALWCSVD PWMWNLMQTIFFISQQVIEIPS MQTIFFVFSHDEQMKLKGLKGA

Short, recurrent sequence patterns may exist in different protein because they are required to initiate folding

Non

-hom

olog

pro

tein

s

recurrent sequence

Is is a recurrent structure?Bystroff C & Baker D. (1998). Prediction of local structure in proteins using a library of sequence-structure motifs. J Mol Biol 281, 565-77.

Page 33: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Step 2: Cluster protein sequence profiles

Each dot represents a segment of a profile from a MSA from a BLAST search

Each circle represents a cluster of profiles (same length)

Step 1: Compile a database of sequence profiles from MSAs

Page 34: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

K-means clustering(1) Choose K.

(2) Randomly select K centers in the metric space.

(3) Get the distance from each center to each data point.

(4) Assign each data point to the nearest center.

(5) Calculate the new centers using the center-of-mass of the data points.

(6) Repeat from Step 3 until converged.

Final positions of the centers define K clusters of data points.

Page 35: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Example: K=2

x x

Page 36: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Example: K=2

x x

Page 37: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Example: K=2

x x

x

x

Page 38: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Example: K=2

x x

x

x

Page 39: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Example: K=2

x x

x

x

x

x

Page 40: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Example: K=2

x x

x

x

x

x

Page 41: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Example: K=2

x x

x

x

x

xx

x

Page 42: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Example: K=2

x x

x

x

x

xx

x

no change.Converged.

Final cluster centers

Page 43: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

K-means clustering

(1) Requires knowledge of K, the number of classes

(2) Requires the objects to exist in a “metric space”.

(3) Stochastic. Depends on the starting points.

Page 44: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Distance/similarity metrics for profiles. We tried them all.

(1) Manhattan, or City-Block metric (a distance metric)

D( p, q) = P pij( ) − P qij( )amînoacidsi

∑positions

j

(2) Entropy (a similarity metric) NOTE: not symmetrical!

S(p,q) = pij log qij( )amînoacidsi

∑positions

j

(3) Correlation (a similarity metric)

S(p,q) =

pij − p( ) qij − q( )amînoacidsi

∑positions

j

pij − p( )2 qij − q( )2amînoacidsi

∑positions

j

∑amînoacidsi

∑positions

j

∑=

pij − p( ) qij − q( )amînoacidsi

∑positions

j

σ pσ q

D( p, q) = LLR pij( )LLR qij( )amînoacidsi

∑positions

j

∑(4) Dpq (similarity metric)

Page 45: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Step 3: Refine using Supervised learning

training set

Search the database for the 400 nearest

neighbors

sequence profile

nearest neighbors

remove all cluster members that do not

conform with the paradigm

Supervised learning finds predictive correlations between two spaces (sequence space and structure space)

We want this profile to predict...

…as long as it is consistent with this

structure.

Page 46: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

diverging type-2 turn

Serine hairpin

Proline helix C-cap alpha-alpha corner glycine helix N-cap

Frayed helix

Type-I hairpin

Amino acids arranged from non-polar to polar

Backbone angles: ψ=green, φ=red

Results: I-sites library

Page 47: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

How do you compare two predictive models?

Accuracy = percent of the predictions that are correct, of the ones that were made.

Coverage = number of possible predictions that were actually predicted.

Confidence = a score to sort the predictions. A more confident prediction should be a more accurate one.

T

F

+ –

T+

T-

F-

F+

Selectivity = Accuracy = T+/(T+ + F+)

Sensitivity = Coverage = T+/(T+ + F-)

OK. But accuracy depends on coverage

OK. But coverage depends on accuracy

Does the confidence of a prediction match its accuracy?

Page 48: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Receiver Operator Characteristic (ROC)

•A way to describe the whole set of scores with a single number.

•Each score has a T or F.

•Sort the scores.

•Starting from the highest scoring, draw a vector up for a true, to the right for a false.

•Calculate ROC = the normalized area under this curve.

•If all of the true scores are greater that the greatest false score, then ROC = 1.0.

• 0.≤ROC≤1.

Better way to compare two predictive models

Page 49: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

ROC score0.990 T0.978 F 0.972 T0.966 T0.951 T0.902 F 0.880 F 0.811 F 0.803 F 0.792 T0.766 F 0.751 F 0.723 F 0.696 F 0.688 T0.666 F 0.651 F 0.623 F 0.596 F 0.488 T

Sort the scores, for each score move up one if it is true, right one if it is false.

The area under the curve, divided by the total, is the ROC score. 0 ≤ ROC ≤ 1.

number of falses

num

ber o

f tru

es

Page 50: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

In class exercise: calculate ROC score

0.811 T0.972 T0.766 T0.990 F 0.966 T0.951 F 0.803 F 0.792 F 0.503 F 0.978 T0.478 F

4 T39 F 44 T44 T40 T1 F 39 F 29 F 10 F 44 F 45 T

Which method is better?

Method A Method B

Page 51: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

I-sites word HMM

= HMMSTR! Hidden Markov Model for local protein STRucture

HMM of linked I-sites motifs. Each node is one amino acid.

Size of HMMSTR:282 nodes 317 transitions

(Bystroff et al., JMB 2000)

Motifs can be words in a word-HMM

Page 52: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

SplicingGene finding

Page 53: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

�53

HMMs can be modular

A C

G T

A C

G TNon-emitting connector state

Page 54: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Gene Prediction: Computational Challenge

aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg

Gene!

Thanks to Pavel Pevzner (UCSD) and Chris Burge (MIT) for slides and images.

Page 55: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

• Statistical: coding segments (ORFs or exons) have typical sequences on either end and use different subwords than non-coding segments (introns or intergenic regions).

• Similarity-based: many human genes are similar to genes in mice, chicken, or even bacteria. Therefore, already known mouse, chicken, and bacterial genes may help to find human genes.

Two Approaches to Gene Prediction

Page 56: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Gene Prediction and Motifs

• Upstream regions of genes often contain motifs that can be used for gene prediction

• Reading frame should be “long enough”.

-10

STOP

0 10-35

ATG

TATACT Pribnow Box

TTCCAA GGAGG Ribosomal binding site

Transcription start site

Page 57: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Promoter Structure in Prokaryotes (E.Coli)

Transcription starts at offset 0.

• Pribnow Box (-10)

• Gilbert Box (-30)

• Ribosomal Binding Site (+10)

Page 58: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Ribosomal Binding Site

Page 59: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Central Dogma of biology

Protein

RNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA

Page 60: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• In 1977, Phillip Sharp and Richard Roberts experimented with mRNA of hexon, a viral protein.

– Map hexon mRNA in viral genome by hybridization to adenovirus DNA and electron microscopy

– mRNA-DNA hybrids formed three curious loop structures instead of contiguous duplex segments

Discovery of Split Genes

Page 61: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Expanded Central Dogma exon1 exon2 exon3

intron1 intron2

transcription

translation

splicing

exon = coding intron = non-coding

replication

folding

Page 62: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Splicing mechanism

(http://genes.mit.edu/chris/)

exon1 exon2GU AGA

“branchpoint”5’ splice site3’ splice site

“the donor”“the acceptor”

Spliceosome: Def: A ribonucleoprotein complex, containing RNA and small nuclear ribonucleoproteins (snRNPs) that is assembled during the splicing of messenger RNA primary transcript to excise an intron.

Page 63: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Splicing mechanism

exon1 exon2GU AGA

exon1exon2GU

AGA

exon1 exon2GUAGA

3’-OH

(1)

(2)

(3)

pre-mRNA

lariat loop forms

exon 1 cleaved from lariat.

Spliceosome (not shown) forms.

sans spliceosome

Page 64: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Splicing mechanism

exon1

exon2GUAGA

3’-OH

exon1 exon2GU

AGA

goes to ribosomegets degraded

the “lariat”

Spliceosome (not shown) disassociates.

(4)

(5)

exon 1,2 positioned to ligate

+ ligated exonsliberated lariat

Page 65: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Splicing mechanism on the web

http://neuromuscular.wustl.edu/pathol/diagrams/splicefunct.html

http://neuromuscular.wustl.edu/pathol/diagrams/splicemech.html

Much thanks to T. Wilson, UCSC!

google: wustl neuromuscular splicefunct

google: wustl neuromuscular splicemech

Page 66: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

�66

Evolutionary models for introns

Penny, David, Marc P. Hoeppner, Anthony M. Poole, and Daniel C. Jeffares. "An overview of the introns-first theory." Journal of molecular evolution 69, no. 5 (2009): 527-540.

Page 67: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

How to find splice points, using BLAST.

(1) Translate the DNA in all 6 frames.

(2) Search the database of protein sequences using the translations (blastx).

(3) Using the complete protein sequence, align it to the translation and find the regions of (near) perfect identity. These will abruptly end at the intron start site.

(4) Find the 5’-GT or 3’-AG signal at the point where the identity matches abruptly end.

(5) If your translation has an insertion with nearly perfect matches on either side, you have an alternative splicing.

Page 68: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

In Class exercise: find the alternative spliced variants

Go to NCBI, search nucleotides for AKAP9 (you should get the sequence with accession number NM_005751.4, GI:197245395)

Select “BLAST sequence”

Select blastx (not tblastx)

Select the nr/nt database. Organism: homo sapiens

Submit.

While waiting, do exercise on the next page....

Page 69: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

A sure sign of alternative splicing in blastx output:

Score = 160 bits (404), Expect = 8e-37 Identities = 85/116 (73%), Positives = 86/116 (74%) Frame = +2

Query: 76820 RSHENGFMEDLDKTWVRYQECDSRSNAPATLTFENMAGAFSFIHSRVGSPWXXXXXXXXX 76999 +SHENGFMEDLDKTWVRYQECDSRSNAPATLTFENMA Sbjct: 778 KSHENGFMEDLDKTWVRYQECDSRSNAPATLTFENMA----------------------- 814

Query: 77000 XXXXRHTGVFMLVAGGIVAGIFLIFIEIAYKRHKDARRKQMQLAFAAVNVWRKNLQ 77167 GVFMLVAGGIVAGIFLIFIEIAYKRHKDARRKQMQLAFAAVNVWRKNLQSbjct: 815 -------GVFMLVAGGIVAGIFLIFIEIAYKRHKDARRKQMQLAFAAVNVWRKNLQ 863

Identical up to the insertion. Identical after the insertion. These must be the same gene.

Page 70: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

HMM for pre-mRNA

3'UTR

AUG

exon

intron (0|1|2)

stop (UAA|UAG|UGA)

5'UTR

5’ splice site (GU...)3’ splice site (...AG)

XXXXXXATG...XXXGUX...XXAGXX...XXGUX...XXAGXX...XXTAAXXX

exon intron exon intron ...

Page 71: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Intron model for mammals

acceptor motif (contains AG)

donor motif(contains GU)

poly-pyrimidine region

branch site

from: Blencowe, BJ. “Exonic splicing enhancers: mechanism of ction, diversity and role in human genetic diseases. TIBS 25:106 (2000)

Page 72: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

�72

Introns and exons have length distriutions

From Chris Burge's thesis, Stanford 1997

Page 73: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

�73

Exons/ORF have 3-base repeats(Tiwari, 1997)Performed a Fourier transformation on DNA sequences; found periodicity of 3 in ORFs and exons, but no periodicity in intergenetic regions anad introns.

Tiwari, S., Ramachandran, S., Bhattacharya, A., Bhattacharya, S., & Ramaswamy, R. (1997). Prediction of probable genes by Fourier analysis of genomic sequences. Bioinformatics, 13(3), 263-270.

0100100100100100100100100000100100010010010011001001001001CATGAGCATGATTACGACCATCATGTCCATGATCGATGACGATTAACTACTAGGATCA

UA=

S(f) = Σ Σ Uα exp( 2πi f j)α jFourier transform of the sequence in binary form, summed over all 4 binary sequences

codingnot coding

Page 74: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

�74

3-base repeats are an intrinsic property of the genetic code!

(Tiwari, 1997)Fourier transformation of (a) a eukaryotic gene showing periodicity of 3, (b) same gene after accounting for codon bias, (c) same protein sequence, using randomly selected codons.

Periodicity of 3 persists. So it is a property of the genetic code!

Page 75: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

HMM module for introns is made to reproduce length distributions

Stanke M, Steinkamp R, Waack S, Morgenstern B. “AUGUSTUS: a web server for gene finding in eukaryotes. “ Nucleic Acids Res. 2004 Jul 1;32:W309-12.

donor sitesequence model

acceptor site sequence model

fixed length + variable length intron sub-model

short variable length intron sub-model, including branch point

DSS ASS

Ifixed Igeo

Ishortp

1-p q

1-q

1

1

Page 76: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

A genefinding HMM: Genescan

initial exon model

intron models

internal exon model

terminal exon model

single exon modelIntergentic Regions

Mirrored models for reverse complement strand

middle exon model

Page 77: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

GENESCAN -- forward strand part

internal exon 0-frame exon 1-frame exon 2-frame

donor splice site

acceptor splice site

non-emitting state

non-emitting state

initial exon 0-frame

terminal exon

intron (short)

intron (long)

Page 78: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

�78

Summary of gene finding HMMs of the GeneScan type.

» Periodicity is seen in exons/ORFs» Length distributions are seen in introns,

exons.» Various motifs are found in introns» Upstream (promotor) and downstream

(poly-A signal) bracket genes.» ALL OF THIS DATA CAN BE CODED INTO

A HMM.

Page 79: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

TwinScan• Aligns two sequences and marks each base as

gap ( - ), mismatch (:), match (|), resulting in a new alphabet of 12 letters: Σ {A-, A:, A |, C-, C:, C |, G-, G:, G |, T-, T:, T|}.

• Run Viterbi algorithm using emissions ek(b) where b ∊ {A-, A:, A|, …, T|}.

http://www.standford.edu/class/cs262/Spring2003/Notes/ln10.pdf

Using statistics and homology at the same time

Page 80: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

TwinScan

more gaps, mismatches

fewer gaps, mismatches

E IbE={A-, A:, A|,

C-, C:, C|, G-, G:, G|, T-, T:, T|}

bI={A-,A:, A|, C-,C:, C|,

G-,G:, G|, T-,T:, T|}

Exon (E) state emits matches (N|), Intron state (I) emits mismatches (N:) and gaps (N-). Gamma values (forward backward values) predict introns, exons. May be combined with motif models for intron, codon models.

exon intron

exonE

intronI

exonE

exonE

exonE

intronI

intronI

Page 81: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Splicing factsExons average 145 nucleotides in length

Contain regulatory elements :ESEs: Exonic splicing enhancersESSs: Exonic splicing silencers

Introns average more than 10x longer than exonsContain regulatory elements(bind regulatory complexes)

ISEs: Intronic splicing enhancersISSs: Intronic splicing silencers

Splice sites5' splice site

Sequence: AGguragu (r = purine)U1 snRNP: Binds to 5' splice site

3' splice siteSequence: yyyyyyy nagG (y= pyrimidine)

Branch siteSequence: ynyuray (r = purine)U2 snRNP: Binds to branch site via RNA:RNA

interactions between snRNA and pre-mRNA

Page 82: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Alternative splicing factsAlternative splicing

Definition: Joining of different 5' and 3' splice sites~80% of alternative splicing results in changes in the encoded proteinUp to 59% of human genes express more than one mRNA by

alternative splicingFunctional effects: Generates several forms of mRNA from single geneAllows functionally diverse protein isoforms to be expressed according to

different regulatory programsStructural effects:

Insert or remove amino acidsShift reading frameIntroduce termination codonGene expression effectsRemoves or inserts regulatory elements controlling translation, mRNA

stability, or localizationRegulation

Splicing pathways modulated according to: Cell typeDevelopmental stageGender External stimuli

Page 83: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

�83

Review

1. What information is used to find genes in prokaryotes?2. How were introns discovered?3. What is the better way to find genes: by homology, or by

statistical models?4. Where is the ribosome binding site relative to the

transcription start site?5. Where is the Pribnow Box? Gilbert box? 6. How can HMMs be linked together to make a larger model?7. What happens to the intron once it is removed?8. What is alternative splicing?9. What do DNA alignments tell you about introns/exons?10.What is a splicing enhancer? silencer?

Page 84: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

Review

Why are there motifs in proteins?

What does MEME do?

How is Gibbs sampling different from MEME?

What kind of data can be clustered using K-means?

What is supervised learning? How is it different from machine learning?

What kind of data goes into a ROC score?

Page 85: MEME • Gibbs sampling • ROC • modular HMMs • Gene findingStochastic version of MEME. (1) Choose initial (or random) guesses of motif locations. (2) Calculate the motif model

�85

How to find motifs, signatures, footprints

MEME -- deterministic EM algorithm for motif finding, needs initial guess.

Gibbs sampling -- stochastic EM algorithm for motif finding, doesn’t need initial guess.

K-means -- unsupervised learning of recurrent patterns, requires a metric space (distance or similarity).

Supervised learning -- EM. Expectation in one "space", maximization in the other.

Method for comparing prediction methodsROC

Aligning (or not) low complexity sequencesmaskingnull models for repeatsword HMMs

Review