doug raiford lesson 3. have a fully sequenced genome how identify the genes? what do we know so...
TRANSCRIPT
Doug RaifordLesson 3
Have a fully sequenced genome
How identify the genes?
What do we know so far?
04/21/23 2Gene Prediction
Remember Start codon codes for methionine Stop codons do not code for an amino acid
Does every ATG mark the beginning of a gene?
Does every TAG, TAA, or TGA mark the end?Start codon: ATGStop codons: TAG, TAA, or TGA
04/21/23 3Gene Prediction
The start and stop codons must be “in frame”
A set of codons must fit between them Length evenly divisible
by three Open reading frame
Series of codons bracketed by start and stop codons (in frame)
04/21/23 4Gene Prediction
The distance between start and stop codons tends to be longer than expected
How long would we expect that distance to be?
04/21/23 5Gene Prediction
There are 64 different codons
A given codon should show-up randomly around once every 64 codons or 192 nts (64*3)
3 stop codons Expect 3 in every 64
codons or once every 21 1/3 codons(21 1/3 * 3 = 64 nts)
04/21/23 6Gene Prediction
Number of genes in E. coli is 4356
Min 44 nts, max 8621 8 are < 64 143 < 128 (3%)
Good start but must be more
Approximately 77,000 ORFs > 2* expected on each strand
Escherichia coli
04/21/23 7Gene Prediction
To “find” a gene would look for nt sequences that look like the parts of a gene
Promoter Region
Coding regionTerminator
Region
RNA polymerase
Start Codon‘ATG’ = Methionine
Stop Codon: non coding‘TAA’, ‘TAG’,
or ‘TGA’04/21/23 8Gene Prediction
Attract polymerase Specific sequences
Gene regulation Each promoter has unique pattern
Motifs
04/21/23 9Gene Prediction
Coding region
-35 -10
Transcription start site
Ribosomal binding site
for -10 sequence T A T A A T
for -35 sequence T T G A C A
Start Codon
Polymerase binding
Slightly different -35 and -10 motifs attract different sigma factors
Genes with similar upstream regions tend to be related: they express similarly
04/21/23 Gene Prediction 10
HairpinFollowed by U-run
(A-run in the DNA)
04/21/23 Gene Prediction 11
Week uracil bindings coupled with hairpin binding with nusA protein bound to polymerase
04/21/23 Gene Prediction 12
DNA AAAAAAAA
PolymeraseUUUUUUU
mRNA
How find?Difficult: fuzzy, not carved in stone
04/21/23 13Gene Prediction
Coding region
-35 -10
Transcription start site
Ribosomal binding site
for -10 sequence T A T A A T
for -35 sequence T T G A C A
Start Codon
Polymerase binding
Hidden Markov Models often usedAll about the statisticsMarkov Chain: series of events along
with probabilities
04/21/23 14Gene Prediction
T A T A A T
A
Start
G or C
Yay! I
found one
or T or A
Previous was a “state machine” representation
Should have states and observations The states are “hidden”
04/21/23 Gene Prediction 15
1 2 4 5 80
A C G T
.25.25.25 .25
1 1 1
T A T A
A C G T
.25.25 .25.25
1.99
.1
1.5
1 1
1
3
.5
1
6 7
A
1 1
T
11
Each state has a probability of “emitting” any given observation
Each state has a probability of “transitioning” to any given next state
04/21/23 Gene Prediction 16
1 2 4 5 80
A C G T
.25.25.25 .25
1 1 1
T A T A
A C G T
.25.25 .25.25
1.99
.1
1.5
1 1
1
3
.5
1
6 7
A
1 1
T
11
Transition probability matrix Rows represent current state Columns represent state to
which a transition will occur Entry is the probability
associated with that transition Emission probability matrix
Rows represent states Columns represent which
observation is emitted Entry is the probability
associated with that emission
04/21/23 Gene Prediction 17
TRANS
To state
From state
probability
EMIS
Observations
state
probability
Requires a subject matter expert to build a model
Often start with a state for each position in a possible match
Example looking for something similar to TATAAT Might not have both A’s Might have extra one in first slot Never have G’s or C’s
04/21/23 Gene Prediction 18
1 2 4 5 80
A C G T
.25.25.25 .25
1 1 1
T A T A
A C G T
.25.25 .25.25
1.99
.1
1.5
1 1
1
3
.5
1
6 7
A
1 1
T
11
Also need a state for non-participating regions
04/21/23 Gene Prediction 19
1 2 4 5 80
A C G T
.25.25.25 .25
1 1 1
T A T A
A C G T
.25.25 .25.25
1.99
.1
1.5
1 1
1
3
.5
1
6 7
A
1 1
T
11
1 2 4 5 80
A C G T
.25.25.25 .25
1 1 1T A T A
A C G T
.25.25 .25.25
1.99.1
1.5
1 1
1
3
.5
1
6 7
A1 1
T
1 1
First guess as to probabilities Maybe from state associated with first T to A
100% Then 50% 50% whether A or T Then 50% 50% whether A or T Then 100% T
04/21/23 Gene Prediction 20
Baum-Welch or Viterbi algorithmPass the algorithm a sequence of
observations and first guess as to probabilities
It refines the probability matrices
04/21/23 Gene Prediction 21
•Assumes that the sequence adheres to the underlying probabilities.•Traverses states keeping track of actual frequency of emissions and transitions•Adjusts matrices accordingly
Called checking the posterior probabilities Given a sequence, check all possible paths
through the model Multiply the associated probabilities Path with the highest probability is likely the
path through the hidden states Can use the “forward algorithm” to cut down
the number of paths (dynamic programming) Location in sequence where most probable
states are “TATAAT” is a match
04/21/23 Gene Prediction 22
1 2 3 4 50
A C G T
.25.25.25 .25
1 1 1T A T A
A C G T
.25.25 .25.25
116/171/17
1 1 1 1
1
Matlab very useful at matrix operations
04/21/23 Gene Prediction 23
seq =['a','g','c','g','a','t','a','c','g','c','g','a','t','c','g','a','t','a','t','a','g','t','g','c']seq =[1,3,2,3,1,4,1,2,3,2,3,1,4,2,3,1,4,1,4,1,3,4,3,2]EMIS = [.25,.25,.25,.25;#ACGT
0,0,0,1;1,0,0,0;0,0,0,1; 1,0,0,0;.25,.25,.25,.25]
TRANS = [16/17,1/17,0,0,0,0;0,0,1,0,0,0;0,0,0,1,0,0;0,0,0,0,1,0;0,0,0,0,0,1;0,0,0,0,0,1]
seq =['a','g','c','g','a','t','a','c','g','c','g','a','t','c','g','a','t','a','t','a','g','t','g','c']seq =[1,3,2,3,1,4,1,2,3,2,3,1,4,2,3,1,4,1,4,1,3,4,3,2]EMIS = [.25,.25,.25,.25;#ACGT
0,0,0,1;1,0,0,0;0,0,0,1; 1,0,0,0;.25,.25,.25,.25]
TRANS = [16/17,1/17,0,0,0,0;0,0,1,0,0,0;0,0,0,1,0,0;0,0,0,0,1,0;0,0,0,0,0,1;0,0,0,0,0,1]
1 2 3 4 50
A C G T
.25.25.25 .25
1 1 1T A T A
A C G T
.25.25 .25.25
116/171/17
1 1 1 1
1
Gene mark georgia institute http://exon.biology.gatech.edu/
Genscan http://genes.mit.edu/GENSCAN.html
Genie Berkeley http://www.fruitfly.org/seq_tools/genie.ht
mlGlimmer university of maryland
http://www.cbcb.umd.edu/software/GlimmerHMM/
04/21/23 24Gene Prediction
Can include all regions in the model States for each position in each region Coding region could be simple set of
three regions
04/21/23 25Gene Prediction
Coding region-35 -10
Transcription start site
Ribosomal binding site
for -10 sequence T A T A A T
for -35 sequence T T G A C A
Start Codon
Polymerase binding
Termination region
Classic example: states are rainy or sunny If know whether
someone is walking, shopping or cleaning, can predict state
04/21/23 26Gene Prediction
states
Emissions Observations
04/21/23 27Gene Prediction
If something that is observable is dependent on an underlying state can use HMM
In motifs sequence is visible, whether or not a region is a promoter site is not
04/21/23 28Gene Prediction
Each state has a probability of emitting any given observation
Each state has a probability of transitioning to any given next state
04/21/23 Gene Prediction 29
Probabilistic parameters of a hidden Markov model (example)x — statesy — possible observationsa — state transition probabilitiesb — output probabilities