doug raiford lesson 3. have a fully sequenced genome how identify the genes? what do we know so...

29
Doug Raiford Lesson 3

Upload: edwina-james

Post on 13-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

Doug RaifordLesson 3

Page 2: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

Have a fully sequenced genome

How identify the genes?

What do we know so far?

04/21/23 2Gene Prediction

Page 3: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

Remember Start codon codes for methionine Stop codons do not code for an amino acid

Does every ATG mark the beginning of a gene?

Does every TAG, TAA, or TGA mark the end?Start codon: ATGStop codons: TAG, TAA, or TGA

04/21/23 3Gene Prediction

Page 4: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

The start and stop codons must be “in frame”

A set of codons must fit between them Length evenly divisible

by three Open reading frame

Series of codons bracketed by start and stop codons (in frame)

04/21/23 4Gene Prediction

Page 5: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

The distance between start and stop codons tends to be longer than expected

How long would we expect that distance to be?

04/21/23 5Gene Prediction

Page 6: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

There are 64 different codons

A given codon should show-up randomly around once every 64 codons or 192 nts (64*3)

3 stop codons Expect 3 in every 64

codons or once every 21 1/3 codons(21 1/3 * 3 = 64 nts)

04/21/23 6Gene Prediction

Page 7: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

Number of genes in E. coli is 4356

Min 44 nts, max 8621 8 are < 64 143 < 128 (3%)

Good start but must be more

Approximately 77,000 ORFs > 2* expected on each strand

Escherichia coli

04/21/23 7Gene Prediction

Page 8: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

To “find” a gene would look for nt sequences that look like the parts of a gene

Promoter Region

Coding regionTerminator

Region

RNA polymerase

Start Codon‘ATG’ = Methionine

Stop Codon: non coding‘TAA’, ‘TAG’,

or ‘TGA’04/21/23 8Gene Prediction

Page 9: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

Attract polymerase Specific sequences

Gene regulation Each promoter has unique pattern

Motifs

04/21/23 9Gene Prediction

Coding region

-35 -10

Transcription start site

Ribosomal binding site

for -10 sequence T A T A A T

for -35 sequence T T G A C A

Start Codon

Polymerase binding

Page 10: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

Slightly different -35 and -10 motifs attract different sigma factors

Genes with similar upstream regions tend to be related: they express similarly

04/21/23 Gene Prediction 10

Page 11: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

HairpinFollowed by U-run

(A-run in the DNA)

04/21/23 Gene Prediction 11

Page 12: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

Week uracil bindings coupled with hairpin binding with nusA protein bound to polymerase

04/21/23 Gene Prediction 12

DNA AAAAAAAA

PolymeraseUUUUUUU

mRNA

Page 13: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

How find?Difficult: fuzzy, not carved in stone

04/21/23 13Gene Prediction

Coding region

-35 -10

Transcription start site

Ribosomal binding site

for -10 sequence T A T A A T

for -35 sequence T T G A C A

Start Codon

Polymerase binding

Page 14: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

Hidden Markov Models often usedAll about the statisticsMarkov Chain: series of events along

with probabilities

04/21/23 14Gene Prediction

T A T A A T

A

Start

G or C

Yay! I

found one

or T or A

Page 15: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

Previous was a “state machine” representation

Should have states and observations The states are “hidden”

04/21/23 Gene Prediction 15

1 2 4 5 80

A C G T

.25.25.25 .25

1 1 1

T A T A

A C G T

.25.25 .25.25

1.99

.1

1.5

1 1

1

3

.5

1

6 7

A

1 1

T

11

Page 16: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

Each state has a probability of “emitting” any given observation

Each state has a probability of “transitioning” to any given next state

04/21/23 Gene Prediction 16

1 2 4 5 80

A C G T

.25.25.25 .25

1 1 1

T A T A

A C G T

.25.25 .25.25

1.99

.1

1.5

1 1

1

3

.5

1

6 7

A

1 1

T

11

Page 17: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

Transition probability matrix Rows represent current state Columns represent state to

which a transition will occur Entry is the probability

associated with that transition Emission probability matrix

Rows represent states Columns represent which

observation is emitted Entry is the probability

associated with that emission

04/21/23 Gene Prediction 17

TRANS

To state

From state

probability

EMIS

Observations

state

probability

Page 18: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

Requires a subject matter expert to build a model

Often start with a state for each position in a possible match

Example looking for something similar to TATAAT Might not have both A’s Might have extra one in first slot Never have G’s or C’s

04/21/23 Gene Prediction 18

1 2 4 5 80

A C G T

.25.25.25 .25

1 1 1

T A T A

A C G T

.25.25 .25.25

1.99

.1

1.5

1 1

1

3

.5

1

6 7

A

1 1

T

11

Page 19: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

Also need a state for non-participating regions

04/21/23 Gene Prediction 19

1 2 4 5 80

A C G T

.25.25.25 .25

1 1 1

T A T A

A C G T

.25.25 .25.25

1.99

.1

1.5

1 1

1

3

.5

1

6 7

A

1 1

T

11

Page 20: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

1 2 4 5 80

A C G T

.25.25.25 .25

1 1 1T A T A

A C G T

.25.25 .25.25

1.99.1

1.5

1 1

1

3

.5

1

6 7

A1 1

T

1 1

First guess as to probabilities Maybe from state associated with first T to A

100% Then 50% 50% whether A or T Then 50% 50% whether A or T Then 100% T

04/21/23 Gene Prediction 20

Page 21: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

Baum-Welch or Viterbi algorithmPass the algorithm a sequence of

observations and first guess as to probabilities

It refines the probability matrices

04/21/23 Gene Prediction 21

•Assumes that the sequence adheres to the underlying probabilities.•Traverses states keeping track of actual frequency of emissions and transitions•Adjusts matrices accordingly

Page 22: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

Called checking the posterior probabilities Given a sequence, check all possible paths

through the model Multiply the associated probabilities Path with the highest probability is likely the

path through the hidden states Can use the “forward algorithm” to cut down

the number of paths (dynamic programming) Location in sequence where most probable

states are “TATAAT” is a match

04/21/23 Gene Prediction 22

1 2 3 4 50

A C G T

.25.25.25 .25

1 1 1T A T A

A C G T

.25.25 .25.25

116/171/17

1 1 1 1

1

Page 23: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

Matlab very useful at matrix operations

04/21/23 Gene Prediction 23

seq =['a','g','c','g','a','t','a','c','g','c','g','a','t','c','g','a','t','a','t','a','g','t','g','c']seq =[1,3,2,3,1,4,1,2,3,2,3,1,4,2,3,1,4,1,4,1,3,4,3,2]EMIS = [.25,.25,.25,.25;#ACGT

0,0,0,1;1,0,0,0;0,0,0,1; 1,0,0,0;.25,.25,.25,.25]

TRANS = [16/17,1/17,0,0,0,0;0,0,1,0,0,0;0,0,0,1,0,0;0,0,0,0,1,0;0,0,0,0,0,1;0,0,0,0,0,1]

seq =['a','g','c','g','a','t','a','c','g','c','g','a','t','c','g','a','t','a','t','a','g','t','g','c']seq =[1,3,2,3,1,4,1,2,3,2,3,1,4,2,3,1,4,1,4,1,3,4,3,2]EMIS = [.25,.25,.25,.25;#ACGT

0,0,0,1;1,0,0,0;0,0,0,1; 1,0,0,0;.25,.25,.25,.25]

TRANS = [16/17,1/17,0,0,0,0;0,0,1,0,0,0;0,0,0,1,0,0;0,0,0,0,1,0;0,0,0,0,0,1;0,0,0,0,0,1]

1 2 3 4 50

A C G T

.25.25.25 .25

1 1 1T A T A

A C G T

.25.25 .25.25

116/171/17

1 1 1 1

1

Page 24: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

Gene mark georgia institute http://exon.biology.gatech.edu/

Genscan http://genes.mit.edu/GENSCAN.html

Genie Berkeley http://www.fruitfly.org/seq_tools/genie.ht

mlGlimmer university of maryland

http://www.cbcb.umd.edu/software/GlimmerHMM/

04/21/23 24Gene Prediction

Page 25: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

Can include all regions in the model States for each position in each region Coding region could be simple set of

three regions

04/21/23 25Gene Prediction

Coding region-35 -10

Transcription start site

Ribosomal binding site

for -10 sequence T A T A A T

for -35 sequence T T G A C A

Start Codon

Polymerase binding

Termination region

Page 26: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

Classic example: states are rainy or sunny If know whether

someone is walking, shopping or cleaning, can predict state

04/21/23 26Gene Prediction

states

Emissions Observations

Page 27: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

04/21/23 27Gene Prediction

Page 28: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

If something that is observable is dependent on an underlying state can use HMM

In motifs sequence is visible, whether or not a region is a promoter site is not

04/21/23 28Gene Prediction

Page 29: Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction

Each state has a probability of emitting any given observation

Each state has a probability of transitioning to any given next state

04/21/23 Gene Prediction 29

Probabilistic parameters of a hidden Markov model (example)x — statesy — possible observationsa — state transition probabilitiesb — output probabilities