computational gene prediction - university of … · 2011-05-24 · this makes computational gene...

SERGEI L KOSAKOVSKY POND [[email protected]] CSE/BIMM/BENG 181 MAY 24, 2011

COMPUTATIONAL GENE PREDICTION

mailto:[email protected]



DEFINITIONS

A gene: a nucleotide sequence that codes for a protein

Gene prediction: given a genome, locate the beginning and ending position of every gene.

aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgcttaagctgcggctatgctaatgcatgcggctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctcatgcgg




CENTRAL DOGMA OF MOLECULAR BIOLOGY

HTTP://UPLOAD.WIKIMEDIA.ORG/WIKIPEDIA/EN/6/68/CENTRAL_DOGMA_OF_MOLECULAR_BIOCHEMISTRY_WITH_ENZYMES.JPG

CCTGAGCCAACTATTGATGAA

PEPTIDE

CCUGAGCCAACUAUUGAUGAA



http://en.wikipedia.org/wiki/File:Extended_Central_Dogma_with_Enzymes.jpg

http://en.wikipedia.org/wiki/File:Extended_Central_Dogma_with_Enzymes.jpg


BRIEF HISTORY

“The central dogma of molecular biology deals with the detailed residue-by-residue transfer of sequential information. It states that such information cannot be transfered from protein to either protein of nucleic acid”. Francis Crick. Nature 1970

Orig ina l l y s ta ted in 1958 , but questioned in the 1960s due to evidence of viral RNA to DNA transfer (shown by H. Temin and others)




CODONS

In 1961 Sydney Brenner and Francis Crick discovered frameshifting mutations

Systematically deleted nucleotides from DNA

Single and double deletions dramatically altered protein product

Effects of triple deletions were minor

Conclusion: every triplet of nucleotides – a codon – maps to exactly one amino acid in a protein




GENETIC CODEAminoacid Codons Redundancy

Alanine GC* 4

Cysteine TGC,TGT 2

Aspartic Acid GAC,GAT 2

Glutamine Acid GAA,GAG 2

Phenylalanine TTC,TTT 2

Glycin GG* 4

Histidine CAC,CAT 2

Isoleucine ATA,ATC,ATT 3

Lysine AAA,AAG 2

Leucine CT*,TTA,TTG 6

Methionine ATG 1

Aspargine AAC,AAT 2

Proline CC* 4

Glutamine CAA,CAG 2

Arginine AGA,AGG,CG* 6

Serine AGC,AGT,TC* 6

Threonine AC* 4

Valine GT* 4

Tryptophan TGG 1

Tyrosine TAC,TAT 2

Stop TAA,TAG,TGA 3

64 codons are mapped to 20 (+stop) amino-acid characters via a genetic code

Genetic codes may differ slightly between organisms and genomes (e.g. nuclear vs mitochondrial)

Multiple and differing redundancies in the genetic code

Synonymous and non-synonymous substitutions are fundamentally different




SIX READING FRAMESHIV-1 protease

DNA: CCAATAAGTC CTATTGAAAC TGTACCAGTA ACAAAGCCAG GAATGGATGG CCCAAAGGTT AAACAATGGC CATTAACAGA AGAGAAAAAA GC

Protein translation:

In frame: PISPIETVPVTKPGMDGPKVKQWPLTEEKK +1: QXVLLKLYQXQSQEWMAQRLNNGHXQKRKK +2 NKSYXNCTSNKARNGWPKGXTMAINRREKSX marks a stop codon which signals the ribosome to stop protein synthesis.

Reverse complements are complementary DNA strands (opposite direction and complementary bases)

They define 3 other reading frames




CONTIGUOUS VS SPLICED GENESBased on bacterial experimentation, the sequences of DNA, RNA and protein were collinear; evidence suggested that eukaryotes followed the same pattern.

In 1977, Phillip Sharp and Richard Roberts experimented with mRNA of hexon, a viral protein.

Map adenovirus hexon mRNA in viral genome by hybridization to adenovirus DNA and electron microscopy

mRNA-DNA hybrids formed three curious loop structures instead of contiguous duplex segment

HTTP://NOBELPRIZE.ORG/NOBEL_PRIZES/MEDICINE/LAUREATES/1993/SHARP-LECTURE.PDF



http://nobelprize.org/nobel_prizes/medicine/laureates/1993/sharp-lecture.pdf

http://nobelprize.org/nobel_prizes/medicine/laureates/1993/sharp-lecture.pdf


EXONS AND INTRONS

In eukaryotes, a gene is a combination of coding segments (exons) that are interrupted by non-coding segments (introns)

This makes computational gene prediction in eukaryotes even more difficult

Prokaryotes (e.g. bacteria) don’t have introns - their genes are contiguous.




EUKARYOTIC GENES!"#$%%&$'#()*#+'",&&-()./#"0#12&#3"'4/2"%#")#5)"67&-.&#8(/,"9&':#$)-#;<&'.&)1#="<%7&>(1:#()#

?("()0"'<$1(,/#@58;=?#A#?&)&7&$')BCDEF#G&,1H'&#I"1&/#()#?("()0"'<$1(,/F#J%'().&'KL&'7$.F#MCCN#

!

!"#$% &$!"#$! %$&$'()*+,&%!(*-./$01!2! 3-0(/$4$!0562!3-&+,+4+! -7! -&$! -*!0-*$! $8-&+! 9*$34)&%/$+:1! ;-*4,-&+! -7! 4#$+$!

$8-&+!0)<!.$!3-=,&%!9%*)<:!-*!&-&3-=,&%!9#)43#$=:>!?,4#!-&/<!4#$!7-*0$*!%,@,&%!*,+$!4-!)0,&-!)3,=+!=A*,&%!4*)&+/)4,-&1!

"#$!3-=,&%!+$%0$&4!$84$&=+!7*-0!)!+4)*4!3-=-&!92"B:!4-!)!+4-(!3-=-&!9"B2>!"2B>!-*!"22:>!?,4#!-&$!-*!0-*$!,&4*-&+!

9B"!4-!2B:!,&!.$4?$$&1! C&4*-&+!)*$!+(/,3$=!-A4!(*,-*! 4-!4*)&+/)4,-&!,&4-!)!(*-4$,&1!D-A*3$E!F)G-*-+!HI>!O&12"-/# 0"'#

="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$!K&,@$*+,4<!;*$++!97-*4#3-0,&%:>!*$(*-=A3$=!?,4#!($*0,++,-&1!

2!%$&$!%$'/&!4#A+!3-&+,+4+!-7!)!+<&4)34,3)//<!@)/,=!+$*,$+!-7!+,%&)/+!7*-0!4#$!+$4!LLM2"B>!B">!2B>!"B2>!

"22>!"2BN!?#,3#!#)@$!.$$&!,=$&4,7,$=!,&! 4#$!,&(A4!+$OA$&3$1!"#$!&$3$++)*<!+<&4)34,3!3-&+4*),&4+!-&!4#$!

()*+$!-7!)!%$&-0,3!+$OA$&3$!)*$E!

!

2"B!"2B!

2"B!B"!

B"!2B!

2B!B"!

2B!"2B!

"2B!2"B!

!

?#$*$!4#$!*A/$!Q!R!,&=,3)4$+!4#)4!+,%&)/!Q!0)<!.$!7-//-?$=!.<!+,%&)/!R!,&!)!+<&4)34,3)//<!@)/,=!()*+$!9*A/$+!

7-*!%$&$+!-&!4#$!-((-+,4$!P62!+4*)&=!)*$!$)+,/<!-.4),&$=!7*-0!4#$+$:1!"#$!+$4!-7!)//!@)/,=!()*+$+!7-*!)!%,@$&!

,&(A4!+$OA$&3$!0)<!.$!*$(*$+$&4$=!A+,&%!)!%$'/&#.'$%2!9Q,%1!R:!,&!?#,3#!@$*4,3$+!*$(*$+$&4!(A4)4,@$!+,%&)/+!

)&=!$=%$+!*$(*$+$&4!(-++,./$!$8-&+>!,&4*-&+>!)&=!,&4$*%$&,3!*$%,-&+1!!

!

!"#$%'$!2&!$8)0(/$!()*+$!%*)(#1!S$*4,3$+!)*$!+#-?&!)+!=,&A3/$-4,=$!-*!4*,&A3/$-4,=$!0-4,7+!)4!4#$!.-44-01!T=%$+!=$&-4$!

$8-&+>! ,&4*-&+>!-*! ,&4$*%$&,3! *$%,-&+1!D-A*3$E!F)G-*-+!HI>!O&12"-/# 0"'#="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$!

K&,@$*+,4<!;*$++!97-*4#3-0,&%:>!*$(*-=A3$=!?,4#!($*0,++,-&1!

FROM “ADVANCING THE STATE OF THE ART IN COMPUTATIONAL GENE PREDICTION”, BY WILLIAM H. MAJOROS, UWE OHLER




© 2002 Nature Publishing GroupNATURE REVIEWS | GENETICS VOLUME 3 | SEPTEMBER 2002 | 6 9 9

R E V I EW S

all gene-prediction papers refer to four types of ‘exon’, asshown in FIG. 2b; however, these are just the codingregions of the exons. To avoid the misuse of these terms,I refer to subclasses of exons in this article as 5! CDS,itexon, 3! CDS and intronless CDS.

Finding internal coding exonsTo determine exon–intron organization, an attempt canbe made to detect either the introns or the exons. In earlystudies of pre-mRNA splicing, short splicing signals wereidentified in introns (FIG. 3): the donor site (5! splice siteor 5! ss), which is characterized by the consensusAG|GURAGU; the acceptor site (3! ss), which is charac-terized by the consensus YYYYYYYYYYNCAG|G; andthe less-conserved branch site, which is characterized byCURAY10. These genetic elements direct the assembly ofthe SPLICEOSOME by base pairing with the RNA compo-nents of the splicing apparatus, which carries out thesplicing reaction (FIG. 3).Where short introns, which aremostly found in lower eukaryotes (such as yeast),occur,the intron seems to be recognized molecularly by theinteraction of the splicing factors, which bind to bothends of it. Such intron-based gene-structure predictionhas also been used in some computer algorithms (forexample,POMBE in REF. 11). Recently,however, Lim and

many good reviews on this topic, and useful bench-marks in the research (for example, REFS 1–8), a trulyfair comparison of the prediction programs is impos-sible as their performance depends crucially on thespecific TRAINING DATA that are used to develop them.

Gene structure and exon classificationThe main characteristic of a eukaryotic gene is the orga-nization of its structure into exons and introns (FIG. 1).Generally, all exons can be separated into four classes:5! exons, internal exons, 3! exons and intronless exons(or, simply, intronless genes) (FIG. 2). They can be furthersubdivided into 12 mutually exclusive subclasses,according to their coding content (FIG. 2a), and it hasbeen shown that these subclasses have different statisti-cal properties9. Because a vertebrate gene typically hasmany exons, internal coding exons (itexons, or internaltranslated exons) compose the main subclass that hasbeen the focus of all gene-prediction programs.However, the definition of the term ‘exon’has becomeconfused, either unintentionally (due to lack of knowl-edge) or intentionally (for convenience). This confusionhas led to the term ‘exon’ being used interchangeablywith the term ‘coding sequence’ (CDS), which fails totake into account untranslated regions (UTRs). Almost

TRAINING DATA SET

The known examples of anobject (for example, an exon)that are used to train predictionalgorithms, so that they learn therules for predicting an object.They can be positive trainingsets (consisting of true objects,such as exons) or negativetraining sets (consisting of falseobjects, such as pseudoexons).

SPLICEOSOME

A ribonucleoprotein complexthat is involved in splicingnuclear pre-mRNA. It iscomposed of five small nuclearribonucleoproteins (snRNPs)and more than 50 non-snRNPs,which recognize and assembleon exon–intron boundaries tocatalyse intron processing of thepre-mRNA.

Nucleus

RNA transportand translation

Cytoplasm

Transcription

RNA processing(capping, splicing,polyadenylation)

TSS

Cap

1Promoter

2 3 4 5

Stop

TTS

Poly(A)site

Poly(A)

Protein

5! UTR 3! UTRCDS

AUG Stop

ATG

StopAUG

Genomic DNA

Pre-mRNA

mRNA

Coding sequence (CDS) Polypeptide Ribosome Untranslated (UTR) sequence

Cap Poly(A)

Figure 1 | The central dogma of gene expression. In the typical process of eukaryotic gene expression, a gene is transcribed

from DNA to pre-mRNA. mRNA is then produced from pre-mRNA by RNA processing, which includes the capping, splicing and

polyadenylation of the transcript. It is then transported from the nucleus to the cytoplasm for translation. TSS, transcription start site;

TTS, transcription termination site.

FROM “COMPUTATIONAL PREDICTION OF EUKARYOTIC PROTEIN-CODING GENES ”, BY MICHAEL Q ZHANG. NATURE REVIEWS GENETICS 3, 698-709




GENE FINDING APPROACHES

Direct

Close matches to ESTs, cDNA or protein sequences from the same or closely related organism

Computational

Something that matches an already known gene (homology)

Something that matches statistical patterns common to all genes (ab initio)

Hybrid




STATISTICAL APPROACH: METAPHOR IN UNKNOWN LANGUAGE

Noting the differing frequencies of symbols (e.g. ‘%’, ‘.’, ‘-’) and numerical symbols could you distinguish between a story and a stock report in a foreign newspaper?




WHAT CAN WE MEASURE ABOUT GENES?

ORF (Open Reading Frame): a sequence started by ATG and terminated by a stop codon (a.g TAA, TAG, TGA)

Codon Usage: the preference for using specific synonymous codons most frequently measured by CAI (Codon Adaptation Index)

Features and motifs

Promoters, splice sites, enhancers, untranslated regions (UTRs)




OPEN READING FRAMES

Detect potential coding regions by looking at ORFs

A genome of length n is comprised of (n/3) codons

Stop codons break genome into segments

The subsegments of these that start from the Start codon (ATG) are ORFs

Some ORFs can overlap and code for different genes!

Genomic Sequence

Open reading frame

ATG TGA




PROKARYOTES OR INTRON-LESS GENES

The basic concept is to look for ORFs that ‘look like’ genes:

Initially, long enough (~100 codons or longer)

But short ORFs are actually quite frequent in eukaryotic genes.

Have a believable codon composition, as measured by, e.g. the codon adaptation index (CAI)

SMALL OPEN READING FRAMES: BEAUTIFUL NEEDLES IN THE HAYSTACKMUNIRA A. BASRAI, PHILIP HIETER, AND JEF D. BOEKE

GENOME RES. 1997. 7: 768-771

S. cerevisia annotated (in 1997) vs all ORFs



http://genome.cshlp.org/search?author1=Munira+A.+Basrai&sortspec=date&submit=Submit

http://genome.cshlp.org/search?author1=Munira+A.+Basrai&sortspec=date&submit=Submit

http://genome.cshlp.org/search?author1=Philip+Hieter&sortspec=date&submit=Submit

http://genome.cshlp.org/search?author1=Philip+Hieter&sortspec=date&submit=Submit

http://genome.cshlp.org/search?author1=Jef+D.+Boeke&sortspec=date&submit=Submit

http://genome.cshlp.org/search?author1=Jef+D.+Boeke&sortspec=date&submit=Submit


Measures the relative abundance or paucity of a particular codon for a given organism/gene.

E.g. in a representative dataset of HIV-1 polymerase sequences the four codons that map to Alanine have a rather skewed distribution:

Codon CountGCA 41576GCC 9461GCG 1017GCT 11031




Define relative synonymous codon usage (RSCU) for a pair (i,j), where i is an amino-acid, and j is one of the ni codons mapping to it as. Xij is the count of the j-th codon for amino-acid i.

An RSCU > 1 indicates a preferred codon and < 1 – an avoided codon

Further define relative adaptiveness wij as:

COMPUTING CAI

RSCUij =Xij

1ni

�ni

k=1 Xk

Codon Count RCSU w

GCA 41576 2.64 1

GCC 9461 0.60 0.23

GCG 1017 0.064 0.02

GCT 11031 0.70 0.27

wij =RSCUij

RSCUmax=

Xij

maxj Xij




SHARP AND LI, 1987

ORGANISM WIDE CODON USAGE




THE CAI OF A GENEThe observed CAI of a sequence with L codons is the geometric mean of each of the codons:

This is compared with the maximum possible CAI of all codon sequences with the same length that code for the same protein sequence to derive CAI.

CAIobs =

�L�

k=1

RSCUk

�1/L

CAImax =

�L�

k=1

RSCUmax

�1/L

CAI = CAIobs/CAImax




CAI DISTRIBUTION IN GENES

SHARP AND LI, 1987

Caveats

Some genes have unusual (for the organism) codon usage patterns

Predictive power of CAI depends on the length of the sequence, and many are quite short




A SIMPLE HMM FOR FINDING PROKARYOTIC/INTRON-LESS GENES

Nucleic Acids Research, 1998, Vol. 26, No. 4


Figure 1. Hidden Markov model of a prokaryotic nucleotide sequence used in theGeneMark.hmm algorithm. The hidden states of the model are represented as ovalsin the figure, and arrows correspond to allowed transitions between the states.

The HMM framework of GeneMark.hmm, the logic oftransitions between hidden Markov states, followed the logic ofthe genetic structure of the bacterial genome (Fig. 1). The Markovmodels of coding and non-coding regions were incorporated intothe HMM framework to generate stretches of DNA sequencewith coding or non-coding statistical patterns. This type of HMMarchitecture is known as ‘HMM with duration’ (13). Thesequence of hidden states associated with a given DNA sequenceS, carries information on positions where coding function isswitching into non-coding and vice versa. Thus, the previouslyintroduced functional sequence A becomes equivalent to thesequence of hidden states, called the HMM trajectory. Since thenucleotide sequence S is given, every possible sequence A couldbe assessed by the value of P(A!S), the conditional probability ofA given S. This evaluation made use of the whole set of statisticalmodels (see Materials and Methods). The core GeneMark.hmmprocedure is the Viterbi algorithm (13) that finds the sequence A*.However, this core procedure did not take into account thepossibility of gene overlaps since the observed overlaps, thoughfrequent, were not extensive enough to provide sufficient data forderiving statistical models of overlapping genes in severalpossible orientations. To further improve the prediction of thetranslation start position the model of the ribosome binding site(RBS) was derived. This model was used to refine translationinitiation codon prediction at the post-processing step.

The GeneMark.hmm program was evaluated on several testsets including sequences of the 10 complete bacterial genomesmentioned above. The GeneMark.hmm predictions were comparedwith GeneBank annotations. It was shown that the frequency ofexact gene predictions is much higher than that of GeneMark (theversion which also used the RBS model). We understand that theevaluation of the algorithm performance by comparison with thedatabase annotation may not be enough conclusive evidence,since only in a few cases is the precise position of the translationinitiation codon known from an experiment. However, thedatabase annotation of the initiation codon represents the expertdecision summarizing much indirect evidence and is thought tobe close to the real one. The GeneMark program, actually, was

able to correctly identify ORFs where 98% of all genes predictedby GeneMark.hmm resided. Also there were genes missed byGeneMark.hmm, mainly due to overlaps, that were recovered byGeneMark. However, the GeneMark.hmm program made severalnew predictions and some of them were confirmed by similaritysearch. It seems that the GeneMark.hmm development brought uscloser to the goal of accurate prediction of bacterial genes andfurther arguments in favor of this statement are presented below.

MATERIALS AND METHODS

Materials

We have used DNA sequences of the complete genomes ofH.influenzae (GenBank accession no. L42023), M.genitalium(L43967), M.jannaschii (L77117), M.pneumoniae (U00089),Synechocystis PCC6803 (synecho), E.coli (U00096), H.pylori(AE000511), M.thermoauthotrophicum (AE000666), B.subtilis(AL009126), Archeoglobus fulgidus (AE000782). The data onannotated E.coli RBS were provided by W. Hayes (22). The dataon experimentally verified N-terminal protein sequences werekindly provided by A. Link (23). The Markov models parameterswere obtained from the GeneMark library (http://exon.biology.gatech.edu/ !genmark/matrices/ ).

Model of prokaryotic sequence structure

The architecture of the hidden Markov model used in theGeneMark.hmm algorithm is shown in Figure 1. To dealsimultaneously with direct and reverse DNA strands, as was donein the initial GeneMark algorithm (11), nine hidden states weredefined. These states correspond to the functional units ofbacterial genomes, namely: (i) a Typical gene in the direct strand,(ii) a Typical gene in the reverse strand, (iii) an Atypical gene inthe direct strand, (iv) an Atypical gene in the reverse strand, (v) anon-coding (intergenic) region, (vi/vii) start/stop codons in thedirect strand, and (viii/ix) start/stop codons in the reverse strand.It should be mentioned that this HMM does not account for geneoverlap (see below). The models of Typical and Atypical geneswere derived from the sets of protein-coding DNA sequenceobtained by clusterization of the whole set of genes from thegenome of a given species (22). The names ‘Typical’ and‘Atypical’ were used for the following reason. For the E.coligenome it was shown that the majority of the E.coli genes mainlybelong to the cluster of Typical genes, while many genes that arebelieved to have been horizontally transferred into the E.coli genomefall into the cluster of Atypical genes. Note, that the comprehensiveaccounts on the E.coli genes evolutionary classification have beenpresented earlier (24,25).

An important feature of the proposed HMM architecture is thatany coding as well as non-coding hidden state is allowed togenerate a nucleotide sequence, observed sequence, of the lengthof hidden state duration (13). Such an explicit state durationHMM was used previously in algorithms Genie and GENSCAN(18,20). The crucial point, however, is that an observed DNAsequence S = {b1, b2, ..., bL} is thought to be generated by anHMM such as depicted in Figure 1, in parallel with the HMMtransitions from one hidden state to another. The hidden statetrajectory A, one of a variety of allowed paths, can be conciselyrepresented as a sequence of M hidden states ai having durationdi: A = {(a1d1)(a2d2) ... (aMdM)}, "di = L. For a given sequenceof observed states (nucleotides) S = {b1, b2, ..., bL} the optimal

In this generalized HMM, some hidden states are allowed to emit a variable length substring, instead of a single letter.

The idea is that the ‘gene’ state emits the whole sequence, instead of N individual letter.

The length of the substring is drawn from a pre-defined probability function.

The Viterbi algorithm can be extended to deal with this generalization

Atypical genes are necessary to deal with, most prominently, horizontal transfers.




EMISSION LENGTH DISTRIBUTIONS CAN BE DETERMINED EMPIRICALLY


Figure 2. Length distribution probability densities of protein-coding and non-codingregions derived from the annotated E.coli genomic DNA (histograms). (a) Codingregions; the solid curve is the approximation by ! distribution g(d) = Nc(d/Dc)2

exp(–d/Dc), where d is the length in nt, Dc = 300 nt, Nc is the coefficient chosen tonormalize the distribution function on the interval from 30 nt (the minimal length ofcoding region) to 7155 nt (the maximal length). (b) Non-coding regions; the solidcurve is the approximation by exponential distribution f(d) = Nnexp(–d/Dn), where Dn

= 150 nt. The coefficient Nn normalizes the distribution function on the interval from1 to 1000 nt.

R� �w

k�1

n2b(k) 7

Here nb(k) is the number of symbols b (b = T, C, A, G) in theposition (column) k of the window alignment. In each step of thesimulated annealing algorithm iterative procedure, one of the 325sequences chosen at random was shifted to the right or to the left,relative to the fixed window, for a randomly chosen number ofpositions (with no gaps, deletions or insertions). The matchingscore R* for the resulting alignment was calculated (equation 7).If R* was larger than R, the new alignment was unconditionallyaccepted and used as the starting point for the next iterative step.Otherwise, the new alignment was accepted with the probabilityexp[–R –R*)/T], where the parameter T can be interpreted as the‘temperature’ in the annealing procedure. We used the standardexponential cooling schedule Tn+1 = cTn, where c = 0.999999.The window size was chosen to be equal to w = 5.

Table 1. Nucleotide frequencies for the RBS model

Nucleotide Position

1 2 3 4 5

T 0.161 0.050 0.012 0.071 0.115

C 0.077 0.037 0.012 0.025 0.046

A 0.681 0.105 0.015 0.861 0.164

G 0.077 0.808 0.960 0.043 0.659

The model was derived using the multiple sequence alignment of 325 annotated

ribosomal binding sites (see text). Given the set of aligned sequences, the frequency

of a given nucleotide was calculated as the number of occurrences of this nucleotide

in a given position divided by the total number of sequences.

The finally obtained alignment of the 325 sequences hasrevealed the RBS sequence pattern in the form of a matrix ofpositional nucleotide frequencies (Table 1). It is seen that thematrix defines the strong consensus sequence: AGGAG, whichis complementary to a pentamer located in the E.coli 16S rRNAnear its 3!-end. This observation is in a good agreement with thegenerally accepted mechanism of ribosome-mRNA binding.Note that a similar result was obtained previously (27). Toevaluate a putative RBS we calculated its probabilistic score asthe product of corresponding elements of the matrix given inTable 1. The threshold value for RBS score was chosen as0.00025. It can be shown that the log of this score is proportionalto ribosome binding energy (with appropriate sign) under theassumption of independent formation of ribonucleotide pairs.

Algorithm modifications for genomes other than E.coli

The GeneMark.hmm predictions were obtained for nine otherbacterial genomes. In these computations we used the speciesspecific Markov models of coding and non-coding regions. Allother parameters of the GeneMark.hmm algorithm stayed thesame as defined for the E.coli genome. It is worth mentioning thatfor the gram-positive bacterium, B.subtilis, we have slightlymodified the RBS prediction procedure. In species, such asB.subtilis, that do not have the ribosomal protein S1 involved ininitiation of the ribosome–mRNA complex, the elevated strengthof ribosome binding sites is thought to be a compensatorymechanism to facilitate ribosome binding. For the B.subtilis casethe described above alignment procedure produced a highlybiased frequency pattern with the strong RBS consensus. Toobtain reasonable agreement between predicted initiation codonsof B.subtilis genes and annotated ones we had to admit tocompetition the alternative start codons located not only upstreamto the Viterbi prediction of translation start, but also those locateddownstream up the 66 nt distance. We think that this rule couldbe applicable to all other genomes, but presently, there is atendency in genome annotation process to prefer longer ORFs toshorter ones provided there is no convincing evidence in favor ofthe shorter one. Statistically, this tendency is well justified sinceit is expected that in about 75% of cases actual genes occupy thelongest ORFs. This figure can be obtained as follows. Considerthe set of four codons: ATG, TAA, TAG, TGA and an intergenicregion situated upstream to the true initiation codon of a gene X.Read codons in 5! direction in the same reading frame as theinitiation codon until the first codon from the above set is met. Ifthis codon is ATG, then the gene X does not occupy the longestORF. Otherwise gene X does occupy the longest ORF, which


Figure 2. Length distribution probability densities of protein-coding and non-codingregions derived from the annotated E.coli genomic DNA (histograms). (a) Codingregions; the solid curve is the approximation by ! distribution g(d) = Nc(d/Dc)2

exp(–d/Dc), where d is the length in nt, Dc = 300 nt, Nc is the coefficient chosen tonormalize the distribution function on the interval from 30 nt (the minimal length ofcoding region) to 7155 nt (the maximal length). (b) Non-coding regions; the solidcurve is the approximation by exponential distribution f(d) = Nnexp(–d/Dn), where Dn

= 150 nt. The coefficient Nn normalizes the distribution function on the interval from1 to 1000 nt.

R� �w

k�1

n2b(k) 7

Here nb(k) is the number of symbols b (b = T, C, A, G) in theposition (column) k of the window alignment. In each step of thesimulated annealing algorithm iterative procedure, one of the 325sequences chosen at random was shifted to the right or to the left,relative to the fixed window, for a randomly chosen number ofpositions (with no gaps, deletions or insertions). The matchingscore R* for the resulting alignment was calculated (equation 7).If R* was larger than R, the new alignment was unconditionallyaccepted and used as the starting point for the next iterative step.Otherwise, the new alignment was accepted with the probabilityexp[–R –R*)/T], where the parameter T can be interpreted as the‘temperature’ in the annealing procedure. We used the standardexponential cooling schedule Tn+1 = cTn, where c = 0.999999.The window size was chosen to be equal to w = 5.

Table 1. Nucleotide frequencies for the RBS model

Nucleotide Position

1 2 3 4 5

T 0.161 0.050 0.012 0.071 0.115

C 0.077 0.037 0.012 0.025 0.046

A 0.681 0.105 0.015 0.861 0.164

G 0.077 0.808 0.960 0.043 0.659

The model was derived using the multiple sequence alignment of 325 annotated

ribosomal binding sites (see text). Given the set of aligned sequences, the frequency

of a given nucleotide was calculated as the number of occurrences of this nucleotide

in a given position divided by the total number of sequences.

The finally obtained alignment of the 325 sequences hasrevealed the RBS sequence pattern in the form of a matrix ofpositional nucleotide frequencies (Table 1). It is seen that thematrix defines the strong consensus sequence: AGGAG, whichis complementary to a pentamer located in the E.coli 16S rRNAnear its 3!-end. This observation is in a good agreement with thegenerally accepted mechanism of ribosome-mRNA binding.Note that a similar result was obtained previously (27). Toevaluate a putative RBS we calculated its probabilistic score asthe product of corresponding elements of the matrix given inTable 1. The threshold value for RBS score was chosen as0.00025. It can be shown that the log of this score is proportionalto ribosome binding energy (with appropriate sign) under theassumption of independent formation of ribonucleotide pairs.

Algorithm modifications for genomes other than E.coli

The GeneMark.hmm predictions were obtained for nine otherbacterial genomes. In these computations we used the speciesspecific Markov models of coding and non-coding regions. Allother parameters of the GeneMark.hmm algorithm stayed thesame as defined for the E.coli genome. It is worth mentioning thatfor the gram-positive bacterium, B.subtilis, we have slightlymodified the RBS prediction procedure. In species, such asB.subtilis, that do not have the ribosomal protein S1 involved ininitiation of the ribosome–mRNA complex, the elevated strengthof ribosome binding sites is thought to be a compensatorymechanism to facilitate ribosome binding. For the B.subtilis casethe described above alignment procedure produced a highlybiased frequency pattern with the strong RBS consensus. Toobtain reasonable agreement between predicted initiation codonsof B.subtilis genes and annotated ones we had to admit tocompetition the alternative start codons located not only upstreamto the Viterbi prediction of translation start, but also those locateddownstream up the 66 nt distance. We think that this rule couldbe applicable to all other genomes, but presently, there is atendency in genome annotation process to prefer longer ORFs toshorter ones provided there is no convincing evidence in favor ofthe shorter one. Statistically, this tendency is well justified sinceit is expected that in about 75% of cases actual genes occupy thelongest ORFs. This figure can be obtained as follows. Considerthe set of four codons: ATG, TAA, TAG, TGA and an intergenicregion situated upstream to the true initiation codon of a gene X.Read codons in 5! direction in the same reading frame as theinitiation codon until the first codon from the above set is met. Ifthis codon is ATG, then the gene X does not occupy the longestORF. Otherwise gene X does occupy the longest ORF, which


E. COLI CODING E. COLI NONCODING




SPLICE SITE DETECTION

The beginning and end of exons are signaled by donor and acceptor sites that usually have GT and AC di-nucleotides

Detecting these sites is difficult, because GT and AC appear very often

exon 1 exon 2

GT AC

Acceptor

Site

Donor

Site




1128 R. M. Stephens and T. D. Xchneider

x.3 .s

-El _

.::

i : d - - 9 _

protection from: protection from:

s- z

0 = hydroxyl radical 0 = hydroxyl radical

I-

T = Tl T = Tl

. = RNAase-A . = RNAase-A

3’

Figure 1. Information curves and sequence logos for human spliceosome binding sites. The left half of the Figure shows the donor splice sites from position - 8 to position + 17, while the right half shows the -30 to + 10 region around the acceptor sites. Position zero on both curves is the point on the intron adjacent to the splice point, i.e. on the 5’ side, the intron is cut immediately before position zero while on the 3’ side it is cut immediately after position zero. (These are the co-ordinates provided by GenBank.) In the matrix corresponding to each graph, the bottom row, labeled 1, contains the position on the sequences relative to the splice points. The next 4 rows are the numbers of a, c, g and t bases (labeled as such) found at each position. These were used to create the frequency matrix for the analysis. For random sequences the frequencies at a position in the matrix should be about equal, and examination of the matrices at the edges shoivs this to be the case. Examination of the matrices around the zero points, however, shows a decided inequa1it.y in the numbers of the various bases. This means that the sequences around these zero positions are not random, and therefore there is information (conservation) at these points (the spikes on the graph). The top row of the matrices, labeled &,!I)

( = Rsequcnee (E)), is the amount of information present at position I on the sequences. The symbols found between this row and the graph represent those positions apparently protected by the spliceosome in protection experiments (Mount et al.: 1983; Wang & Padgett, 1989; R. A. Padgett, personal communication). The curve and the matrix are summarized by the sequence logos (Schneider & Stephens, 1990) at the bottom of the Figure. In a logo, the total height of the stack of letters

at each position is the amount of information present at that position. The heights of the individual letters a,re proportional to their frequencies at that position. The letters are ordered with the most frequent on top, so the most common base appears on the top of the logo and one may read t,he consensus sequence directly from the Figure. The vertical bars are 2 bits high; the region between them is removed during splicing. Error bars for the heights of the information curves and sequence logos are not, shown in the Figure because they are below the resolution of the printer.

proof of association since patterns found in the

region of a binding site are sometimes unrelated to

the known function (Schneider & Stormo, 1989).)

Subtle features of the splice sites, such as the gentle

sloping of the pyrimidine (C/T) stretch at the

acceptor site, can be seen in the logos.

Figure 1 also shows that the location of the

pattern as indicated by the donor information curve

does not precisely match those bases protected by

the spliceosome in a T, fingerprint experiment by

Mount et al. (1983) (in positions + 7 to + 12), nor

does it match the RNAase-A data in KrB;mer (1987)

(in positions -17 to -4 and +7 to +lli). We must

point out, however, that there is a difference

between a base being protected and a base being

specijkally bound. A base can be protected if it is

7.92 BITS (POSITIONS -3 TO +6) 9.35 BITS (POSITIONS -25 TO +2)

J. MoZ. Biol. (1992) 228, 1124-1136 -

Features of Spliceosome Evolution and Function Inferred fro an Analysis of the Information at Human plice Sites

R. Michael Stephens1>2”f and Thomas Dana Schneider”%

“National Cancer Institute

Frederick Cancer Research and Development Center Laboratory of Mathematical Biology

P.Q. Box B, Frederick, MD 21702-1201, U.S.A.

’ Linganore High School 12013 Old Annapolis Rd., Frederick

MD 21701, U.S.A.

(Received 8 November 1991; accepted 19 August 1992)

An information analysis of the 5’ (donor) and 3’ (acceptor) sequences spanning the ends of nearly 1800 human introns has provided evidence for structural features of splice sites that

bear upon spliceosome evolution and function: (1) S2% of the sequence information (i.e.

sequence conservation) at donor junctions and 97 o/0 of the sequence information at acceptor junctions is confined to the introns, allowing codon choices throughout exons to be largely

unrestricted. The distribution of information at intron-exon junctions is also described in

detail and compared with footprints. (2) Acceptor sites are found to possess enough

information to be located in the transcribed portion of the human genome, whereas donor

sites possess about one bit less than the information needed to locate them independently.

This difference suggests that acceptor sites are located first in humans and, having been located, reduce by a factor of two the number of alternative sites available as donors. Direct

experimenbal evidence exists to support this conclusion. (3) The sequences of donor and

acceptor splice sites exhibit a striking similarity. This suggests that the two junctions derive

from a common ancestor and that during evolution the information of both sites shifted onto the intron. If so, the protein and RNA components that are found in contemporary

spliceosomes, and which are responsible for recognizing donor and accept,or sequences,

should also be related. This conclusion is supported by the common structures found in different parts of the spliceosome.

Keywords: splice; spliceosome; information theory; evolution; human

1. Introduction

In eukaryotic cells, nuclear RNA is usually

spliced prior to translation (for reviews, see Green 1986, 1991; Sharp 1987). Removing introns is the

function of the spliceosome, which is made up of

small nuclear ribonucleoprotein particles (snRNPs§) (Brody & Abelson, 1985; Frendewey & Keller, 1985;

t Present address: Massachusetts Institute of Technology, EC Box R, 3 Ames Street, Cambridge, MA 02139, U.S.A.

$ Author to whom all correspondence should be addressed.

0 Abbreviations used: snRNP, small nuclear ribonucleoprotein particles; bp, base-pair(s); MHC, major histocompatability complex; Ig, immunoglobolin complex; kb, lo3 bases.

Grabowski et al., 1985; Reed et al., 1988; Steitz et al.,

1988). Because reliable splicing is necessary for celli

survival, there must be a precise way for the spliceo-

some to identify RNA splice sites (Aebi & Weissmann, 1987). These patterns are defined by

nucleotides at the ends of the introns and are probably not affected by folded RNA structures, since introns can have large interior deletions without affecting the splicing mechanism (Breathnach & Chambon, 1981). A model of splice site identification utilizing just, the GT and AG dinucleotides is

not acceptable because bases other than these dinucleotides affect splicing (Aebi & Weissmann, 1987; Aebi et al., 19873). Even consensus sequences are not sufficient to characterize splice patterns (Breathnach & Chambon, 1951; Green 1986; Padgett et al., 1986, Aebi & Weissmann, 1987, Aebi

0022-2836/92/241124-13 $08.00/O 1124

Q 1992 Academic Press Limited




“At the core of most gene recognition algorithms is one or more coding measures – functions which produce, given any sample window of sequence, a number or vector intended to measure the degree to which a sample sequence resembles a window of ‘typical’ exonic DNA ... attention can probably limited to six of the twenty or so measures proposed to date”

Evaluated how well different measures performed in recovering known coding sequences (human and E.Coli) based on organism specific training.

Applied linear discriminant analysis to train each method




HTTP://SCIEN.STANFORD.EDU/CLASS/EE368/PROJECTS2000/PROJECT15/ALGORITHMS.HTML

HTTP://HARLEQUIN.JAX.ORG/GENOMEANALYSIS/GENEFINDING04.PPT

LDA SPECIFICITY AND SENSITIVITY



http://scien.stanford.edu/class/ee368/projects2000/project15/algorithms.html





FROM FICKETT AND TUNG 1992 (SP+SN)/2 MEASURE REDUNDANCY

Hexamer based measures come out on top. They are based on the frequencies of 6-mers in one of the frames (0,+1,+2). Highly predictive, because it captures the codon structure, codon usage bias, initiation sites and higher order co-dependancies.

Pseudogenes can look confuse even the best protein-trained approaches.






EXAMPLES OF OTHER FEATURES

E.g promoters in Prokaryotes (E.Coli)

Transcription starts at offset 0.

Pribnow Box (-10)

Gilbert Box (-30)

Ribosomal Binding Site (+10)




TRANSCRIPT ASSEMBLYOnce individual ORFs and splice sites have been identified, they must be assembled into a full transcript.

Could be done with dynamic programming, or HMMs, for example.

Models needs to incorporate relevant biological knowledge.

© 2002 Nature Publishing GroupNATURE REVIEWS | GENETICS VOLUME 3 | SEPTEMBER 2002 | 7 0 1

R E V I EW S

identify these boundaries, which results in predictedgenes being either truncated or fused together.Determining the 3! end of a gene is easier than deter-mining its 5! end. This is because most of the mRNAand EST sequences in GenBank are truncated at their 5! ends. The exon-definition model can also be appliedto 3! exons by replacing the 5! ss with the poly(A) siteand by using the 3!-EXON LENGTH DISTRIBUTION — this isbecause long internal exons are rare in vertebrates,whereas 3! exons frequently extend for many kilobases.The molecular bridge in this case is the interactionbetween the splicing factor U2AF65 and the carboxy-terminal domain of the poly(A) polymerase, which rec-ognizes the poly(A) signal (FIG. 3).

By aligning 3! ESTs against genomic sequence,manypoly(A) sites have been identified. In this way, severalstatistical features (including the well-known poly(A)signal AAUAAA and the (G+U)-rich site) have beenidentified in six species (yeast, rice, Arabidopsis, fly,mouse and human) and used for poly(A)-site recogni-tion22. More reliable 3! ends have been obtained byaligning mRNAs with genomic sequences. By usingsuch a training set, a QDA-based program calledPOLYADQ was developed23, which can predict bothAAUAAA- and AUUAAA-dependent poly(A) sites inthe human genome.

Because almost all gene-prediction programs focus oncoding regions, they can only identify the 3! CDS insteadof the real 3! exon. However, any itexon-recognitionmethods can be modified for this task by replacing thedonor-site signal with the STOP-codon signal (FIG. 2b),together with the correct exon length distribution.

A true 3!-exon-prediction program, JTEF24 (BOX 2),was developed recently using a QDA-based method,which can predict the major subtype of 3! exons — the3! tuexons (translated-then-untranslated 3! exons,which are those that contain the true STOP codon, seeFIG. 2a). Because it integrates several features across the 3! exon, JTEF has substantially improved the accuracy of

LDA is implemented in SPL — a splice-site recogni-tion module of the HEXON program15. A new splice-site detection program, GeneSplicer, has also been developed recently16 and is reported to performfavourably when compared with many other pro-grams (such as NetPlantGene, NetGene2, HSPL,NNSplice, GENIO and SpliceView; BOX 2).

To discriminate CDS from intervening sequence, thebest content measures are the so-called frame-specifichexamer frequencies (BOX 1), because they capturecodon-bias information and codon–codon correlations.They also capture splice-site preferences, which are themost characteristic exon–intron features17. For longopen reading frames (ORFs), such as in bacterial orintronless genes, frame-specific hexamer frequenciesalone can detect most of the CDS regions. An alternativeapproach18 is to use an interpolated Markov model(IMM), in which the higher-order Markov probabilitiesare estimated from an average of the lower-order ones.Because the G+C content of mammalian genomes isbiased by ISOCHORES (for example, see REF. 19), all contentand signal measures need to be computed separately fordifferent G+C regions. Exon size is another importantfeature variable because, for example, itexons have anapproximately LOG-NORMAL DISTRIBUTION9.

By combining splice-site features with exon–intronfeatures (such as CDS measures, exon size and others),and by using a nonlinear quadratic discriminant analy-sis (QDA), the itexon-prediction program MZEF20 hasdone better at the single-exon level than has HEXON(which is based on a LDA method) or GRAIL2 (which isbased on an ANN method21). However, to furtherimprove exon-prediction accuracy, exon–exon depen-dencies also have to be incorporated, as discussed below.

Finding poly(A) sites and 3! exonsThe correct identification of the boundaries of a gene isessential when searching for several genes in a largegenomic region.Many gene-prediction programs fail to

ISOCHORE

A large region of mammaliangenomic DNA sequence inwhich C+G compositions arerelatively uniform.

LOG-NORMAL DISTRIBUTION

The distribution of a randomvariable, the logarithm of whichfollows a normal distribution.A normal log (length) implies astrong fixed-length selectionpressure.

EXON LENGTH DISTRIBUTION

A statistical distribution of exonsizes.

70KU1snRNP

70K

Exon 1CBC Exon 2

U1snRNP

GU GUA AGYRYYRY

U2 snRNP U2AF65 35 SR

70KU1snRNP

Exon 3 GUA AG AAUAAA

CFI CFII

G/UYRYYRY

U2 snRNP U2AF65 35SR CPSFPAP CstF

First exon definition Internal exon definition Last exon definition

SR

Figure 3 | Exon-definition model. Typically, in vertebrates, exons are much shorter than introns. According to the exon-definition

model, before introns are recognized and spliced out, each exon is initially recognized by the protein factors that form a bridge

across it. In this way, each exon, together with its flanking sequences, forms a molecular, as well as a computational, recognition

module (arrows indicate molecular interactions). Modified with permission from REF. 26 © (2002) Macmillan Magazines Ltd.

CBC, cap-binding complex; CFI/II, cleavage factor I/II; CPSF, cleavage and polyadenylation specificity factor; CstF, the cleavage

stimulation factor; PAP, poly(A) polymerase; snRNP, small nuclear RNP; SR, SR protein; U2AF, U2 small nuclear ribonucleoprotein

particle (snRNP) auxiliary factor.

© 2002 Nature Publishing Group700 | SEPTEMBER 2002 | VOLUME 3 www.nature.com/reviews/genetics

R E V I EW S

exons in a ‘sea’ of intronic DNA, where many crypticsplice sites exist. This model has since been validated bymany experiments, and it proposes that an internal exonis initially recognized by the presence of a chain of inter-acting splicing factors that span it (FIG. 3). The binding ofthese trans-acting factors to the pre-mRNA is responsi-ble for the non-random nucleotide patterns that formthe molecular basis for all exon-recognition algorithms.These sequence features are often divided into twotypes: ‘signals’, which correspond to short cis-elementsor boundary sites (such as splice sites and branchsites); and ‘content’, which corresponds to theextended functional regions (such as exons andintrons). To evaluate each feature, one needs to definea scoring function of the feature (also called a featurevariable). The best scoring function is the conditionalprobability P(a|s) that the given sequence s containsthe feature a. According to the Bayes equation P(a|s) = P(s|a)P(a)/P(s) where P(s|a) (that is, the likelihoodP of s containing a). So, a training sample (sequenceset) with the known feature a is built, and then theoccurrence of a particular sequence s is counted.Different features can then be integrated into a singlescore for the whole object (an itexon in this case).Genes are predicted by finding the gene structure thathas the highest score, given the sequence. Approachesdiffer in their choice of features, scoring functions andintegration methods. Once the problem is phrased asa statistical-pattern recognition problem, many statis-tical or machine learning tools are available for recog-nizing these patterns. Indeed, almost all of them havebeen applied to the exon (or gene)-recognition prob-lem. Here, I review just a few generic or popularapproaches.

Most early programs used the simple positionalweight matrix method (WMM, see BOX 1) to identifysplice-site signals. In recent programs, the correlationamong positions in a signal is also explored. Theweight array method (WAM) or Markov models (BOX 1) are used to explore adjacent correlations; deci-sion-tree or maximal-dependence decomposition(MDD) methods are used to explore non-adjacentcorrelations; and artificial neural network (ANN)methods are used to explore arbitrary, nonlineardependencies. These more complex models typicallyyield significant, but not marked, improvements overthe simple WMM. However, major improvementshave come from designing programs that can com-bine many related sequence features. Such featurescan be combined at different levels. At the splice-sitelevel, the simplest way of combining features (such assplice-site score with exon-content score on the onehand and with intron-content score on the otherhand) is to use Fisher’s linear discriminant analysis(LDA; BOX 1). In the LDA method, the total score is alinear sum of the scores of individual features, and thecoefficients are determined by minimizing the predic-tion error using a positive and a negative training dataset. This is equivalent to a perceptron method (forexample, see REF. 14), which identifies an optimal planesurface to separate true positives from true negatives.

Burge12, in a systematic analysis of short introns,havesuggested that these standard splice sites might not besufficient for defining introns in the genomes of plantsand humans.

In vertebrates, the internal exons are small (~140nucleotides on average), whereas introns are typicallymuch larger (with some being more than 100 kb inlength). In 1990, the ‘exon-definition’model13 was pro-posed to explain how the splicing machinery recognizes

Poly(A)

GTAG

GTAG

GTAG

GTAG

TSS

5! exon

GTTSS

GT

TSSGT

TSSGT

a Exon classification

GT

ATG

ATG

5! CDS

StopAG 3! CDS

GTAG itexon

Stop Intronless CDS

b CDS misclassification

Internalexon

GTAG

GTAG

3! exon

AG Poly(A)

AG

TSS

Intronlessgene

=Intronless

gene

Poly(A)

AG Poly(A)

AG Poly(A)

5! utexon

5! uexon

5! utuexon

ituexon

iutexon

iuexon

itexon

iutuexon

3! tuexon

3! uexon

3! utuexon

Figure 2 | Exon classification. a | Exons can be classified

into four classes and 12 subclasses, as shown. b | Coding

sequence (CDS) ‘exons’. Four classes of exon-coding regions.

These regions are not whole exons, except for the internal

coding exons (itexons). i, internal; poly(A), polyadenylation;

t, translated; TSS, transcription start site; u, untranslated .





SOME SIMPLE ASSEMBLY RULES

!"#$%%&$'#()*#+'",&&-()./#"0#12&#3"'4/2"%#")#5)"67&-.&#8(/,"9&':#$)-#;<&'.&)1#="<%7&>(1:#()#


!

!"#$% &$!"#$! %$&$'()*+,&%!(*-./$01!2! 3-0(/$4$!0562!3-&+,+4+! -7! -&$! -*!0-*$! $8-&+! 9*$34)&%/$+:1! ;-*4,-&+! -7! 4#$+$!

$8-&+!0)<!.$!3-=,&%!9%*)<:!-*!&-&3-=,&%!9#)43#$=:>!?,4#!-&/<!4#$!7-*0$*!%,@,&%!*,+$!4-!)0,&-!)3,=+!=A*,&%!4*)&+/)4,-&1!

"#$!3-=,&%!+$%0$&4!$84$&=+!7*-0!)!+4)*4!3-=-&!92"B:!4-!)!+4-(!3-=-&!9"B2>!"2B>!-*!"22:>!?,4#!-&$!-*!0-*$!,&4*-&+!

9B"!4-!2B:!,&!.$4?$$&1! C&4*-&+!)*$!+(/,3$=!-A4!(*,-*! 4-!4*)&+/)4,-&!,&4-!)!(*-4$,&1!D-A*3$E!F)G-*-+!HI>!O&12"-/# 0"'#

="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$!K&,@$*+,4<!;*$++!97-*4#3-0,&%:>!*$(*-=A3$=!?,4#!($*0,++,-&1!

2!%$&$!%$'/&!4#A+!3-&+,+4+!-7!)!+<&4)34,3)//<!@)/,=!+$*,$+!-7!+,%&)/+!7*-0!4#$!+$4!LLM2"B>!B">!2B>!"B2>!

"22>!"2BN!?#,3#!#)@$!.$$&!,=$&4,7,$=!,&! 4#$!,&(A4!+$OA$&3$1!"#$!&$3$++)*<!+<&4)34,3!3-&+4*),&4+!-&!4#$!

()*+$!-7!)!%$&-0,3!+$OA$&3$!)*$E!

!

2"B!"2B!

2"B!B"!

B"!2B!

2B!B"!

2B!"2B!

"2B!2"B!

!

?#$*$!4#$!*A/$!Q!R!,&=,3)4$+!4#)4!+,%&)/!Q!0)<!.$!7-//-?$=!.<!+,%&)/!R!,&!)!+<&4)34,3)//<!@)/,=!()*+$!9*A/$+!

7-*!%$&$+!-&!4#$!-((-+,4$!P62!+4*)&=!)*$!$)+,/<!-.4),&$=!7*-0!4#$+$:1!"#$!+$4!-7!)//!@)/,=!()*+$+!7-*!)!%,@$&!

,&(A4!+$OA$&3$!0)<!.$!*$(*$+$&4$=!A+,&%!)!%$'/&#.'$%2!9Q,%1!R:!,&!?#,3#!@$*4,3$+!*$(*$+$&4!(A4)4,@$!+,%&)/+!

)&=!$=%$+!*$(*$+$&4!(-++,./$!$8-&+>!,&4*-&+>!)&=!,&4$*%$&,3!*$%,-&+1!!

!

!"#$%'$!2&!$8)0(/$!()*+$!%*)(#1!S$*4,3$+!)*$!+#-?&!)+!=,&A3/$-4,=$!-*!4*,&A3/$-4,=$!0-4,7+!)4!4#$!.-44-01!T=%$+!=$&-4$!

$8-&+>! ,&4*-&+>!-*! ,&4$*%$&,3! *$%,-&+1!D-A*3$E!F)G-*-+!HI>!O&12"-/# 0"'#="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$!

K&,@$*+,4<!;*$++!97-*4#3-0,&%:>!*$(*-=A3$=!?,4#!($*0,++,-&1!

FROM “ADVANCING THE STATE OF THE ART IN COMPUTATIONAL GENE PREDICTION”, BY WILLIAM H. MAJOROS, UWE OHLER

!"#$%%&$'#()*#+'",&&-()./#"0#12&#3"'4/2"%#")#5)"67&-.&#8(/,"9&':#$)-#;<&'.&)1#="<%7&>(1:#()#


!

!"#$% &$!"#$! %$&$'()*+,&%!(*-./$01!2! 3-0(/$4$!0562!3-&+,+4+! -7! -&$! -*!0-*$! $8-&+! 9*$34)&%/$+:1! ;-*4,-&+! -7! 4#$+$!

$8-&+!0)<!.$!3-=,&%!9%*)<:!-*!&-&3-=,&%!9#)43#$=:>!?,4#!-&/<!4#$!7-*0$*!%,@,&%!*,+$!4-!)0,&-!)3,=+!=A*,&%!4*)&+/)4,-&1!

"#$!3-=,&%!+$%0$&4!$84$&=+!7*-0!)!+4)*4!3-=-&!92"B:!4-!)!+4-(!3-=-&!9"B2>!"2B>!-*!"22:>!?,4#!-&$!-*!0-*$!,&4*-&+!

9B"!4-!2B:!,&!.$4?$$&1! C&4*-&+!)*$!+(/,3$=!-A4!(*,-*! 4-!4*)&+/)4,-&!,&4-!)!(*-4$,&1!D-A*3$E!F)G-*-+!HI>!O&12"-/# 0"'#

="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$!K&,@$*+,4<!;*$++!97-*4#3-0,&%:>!*$(*-=A3$=!?,4#!($*0,++,-&1!

2!%$&$!%$'/&!4#A+!3-&+,+4+!-7!)!+<&4)34,3)//<!@)/,=!+$*,$+!-7!+,%&)/+!7*-0!4#$!+$4!LLM2"B>!B">!2B>!"B2>!

"22>!"2BN!?#,3#!#)@$!.$$&!,=$&4,7,$=!,&! 4#$!,&(A4!+$OA$&3$1!"#$!&$3$++)*<!+<&4)34,3!3-&+4*),&4+!-&!4#$!

()*+$!-7!)!%$&-0,3!+$OA$&3$!)*$E!

!

2"B!"2B!

2"B!B"!

B"!2B!

2B!B"!

2B!"2B!

"2B!2"B!

!

?#$*$!4#$!*A/$!Q!R!,&=,3)4$+!4#)4!+,%&)/!Q!0)<!.$!7-//-?$=!.<!+,%&)/!R!,&!)!+<&4)34,3)//<!@)/,=!()*+$!9*A/$+!

7-*!%$&$+!-&!4#$!-((-+,4$!P62!+4*)&=!)*$!$)+,/<!-.4),&$=!7*-0!4#$+$:1!"#$!+$4!-7!)//!@)/,=!()*+$+!7-*!)!%,@$&!

,&(A4!+$OA$&3$!0)<!.$!*$(*$+$&4$=!A+,&%!)!%$'/&#.'$%2!9Q,%1!R:!,&!?#,3#!@$*4,3$+!*$(*$+$&4!(A4)4,@$!+,%&)/+!

)&=!$=%$+!*$(*$+$&4!(-++,./$!$8-&+>!,&4*-&+>!)&=!,&4$*%$&,3!*$%,-&+1!!

!

!"#$%'$!2&!$8)0(/$!()*+$!%*)(#1!S$*4,3$+!)*$!+#-?&!)+!=,&A3/$-4,=$!-*!4*,&A3/$-4,=$!0-4,7+!)4!4#$!.-44-01!T=%$+!=$&-4$!

$8-&+>! ,&4*-&+>!-*! ,&4$*%$&,3! *$%,-&+1!D-A*3$E!F)G-*-+!HI>!O&12"-/# 0"'#="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$!

K&,@$*+,4<!;*$++!97-*4#3-0,&%:>!*$(*-=A3$=!?,4#!($*0,++,-&1!

!"#$%%&$'#()*#+'",&&-()./#"0#12&#3"'4/2"%#")#5)"67&-.&#8(/,"9&':#$)-#;<&'.&)1#="<%7&>(1:#()#


!

!"#$% &$!"#$! %$&$'()*+,&%!(*-./$01!2! 3-0(/$4$!0562!3-&+,+4+! -7! -&$! -*!0-*$! $8-&+! 9*$34)&%/$+:1! ;-*4,-&+! -7! 4#$+$!

$8-&+!0)<!.$!3-=,&%!9%*)<:!-*!&-&3-=,&%!9#)43#$=:>!?,4#!-&/<!4#$!7-*0$*!%,@,&%!*,+$!4-!)0,&-!)3,=+!=A*,&%!4*)&+/)4,-&1!

"#$!3-=,&%!+$%0$&4!$84$&=+!7*-0!)!+4)*4!3-=-&!92"B:!4-!)!+4-(!3-=-&!9"B2>!"2B>!-*!"22:>!?,4#!-&$!-*!0-*$!,&4*-&+!

9B"!4-!2B:!,&!.$4?$$&1! C&4*-&+!)*$!+(/,3$=!-A4!(*,-*! 4-!4*)&+/)4,-&!,&4-!)!(*-4$,&1!D-A*3$E!F)G-*-+!HI>!O&12"-/# 0"'#

="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$!K&,@$*+,4<!;*$++!97-*4#3-0,&%:>!*$(*-=A3$=!?,4#!($*0,++,-&1!

2!%$&$!%$'/&!4#A+!3-&+,+4+!-7!)!+<&4)34,3)//<!@)/,=!+$*,$+!-7!+,%&)/+!7*-0!4#$!+$4!LLM2"B>!B">!2B>!"B2>!

"22>!"2BN!?#,3#!#)@$!.$$&!,=$&4,7,$=!,&! 4#$!,&(A4!+$OA$&3$1!"#$!&$3$++)*<!+<&4)34,3!3-&+4*),&4+!-&!4#$!

()*+$!-7!)!%$&-0,3!+$OA$&3$!)*$E!

!

2"B!"2B!

2"B!B"!

B"!2B!

2B!B"!

2B!"2B!

"2B!2"B!

!

?#$*$!4#$!*A/$!Q!R!,&=,3)4$+!4#)4!+,%&)/!Q!0)<!.$!7-//-?$=!.<!+,%&)/!R!,&!)!+<&4)34,3)//<!@)/,=!()*+$!9*A/$+!

7-*!%$&$+!-&!4#$!-((-+,4$!P62!+4*)&=!)*$!$)+,/<!-.4),&$=!7*-0!4#$+$:1!"#$!+$4!-7!)//!@)/,=!()*+$+!7-*!)!%,@$&!

,&(A4!+$OA$&3$!0)<!.$!*$(*$+$&4$=!A+,&%!)!%$'/&#.'$%2!9Q,%1!R:!,&!?#,3#!@$*4,3$+!*$(*$+$&4!(A4)4,@$!+,%&)/+!

)&=!$=%$+!*$(*$+$&4!(-++,./$!$8-&+>!,&4*-&+>!)&=!,&4$*%$&,3!*$%,-&+1!!

!

!"#$%'$!2&!$8)0(/$!()*+$!%*)(#1!S$*4,3$+!)*$!+#-?&!)+!=,&A3/$-4,=$!-*!4*,&A3/$-4,=$!0-4,7+!)4!4#$!.-44-01!T=%$+!=$&-4$!

$8-&+>! ,&4*-&+>!-*! ,&4$*%$&,3! *$%,-&+1!D-A*3$E!F)G-*-+!HI>!O&12"-/# 0"'#="<%H1$1(")$7#P&)&#+'&-(,1(")>!J)0.*,=%$!

K&,@$*+,4<!;*$++!97-*4#3-0,&%:>!*$(*-=A3$=!?,4#!($*0,++,-&1!




USING KNOWN GENES TO PREDICT NEW GENES

Some genomes may be very well-studied, with experimentally verified genes.

Closely-related organisms may have similar genes

Unknown genes in one species may be compared to genes in a sufficiently closely-related species

The idea is that gene structure is on average quite stable.




SIMILARITY-BASED APPROACH TO GENE PREDICTION

Genes in different organisms are similar

The similarity-based approach uses known genes in one genome to predict (unknown) genes in another genome

Problem: Given a known gene and an un-annotated genome sequence, find a set of substrings of the genomic sequence whose concatenation best fits the gene




COMPARING GENES IN TWO GENOMES

SMALL ISLANDS OF SIMILARITY CORRESPONDING TO SIMILARITIES BETWEEN EXONS




USING SIMILARITIES TO FIND THE EXON STRUCTURE

The known frog gene is aligned to different locations in the human genome

Find the “best” path to reveal the exon structure of human gene

Start with a local alignment to find putative exons

Human Genome

Fro

g G

enes (k

now

n)




CHAINING LOCAL ALIGNMENTS

Find substrings that match a given gene sequence (candidate exons); use a cutoff to define significance.

Define a candidate exon as (l, r, w): left, right, weight defined as score of local alignment

Look for a maximum chain of substrings, i.e. a set of non-overlapping non-adjacent intervals.




EXON CHAINING PROBLEM

Locate the number and beginning and end of each interval (2n points)

Find the “best”, i.e. maximum weight path

3

4

11

9

15

5

5

0 2 3 5 6 11 13 16 20 25 27 28 30 32

SCORE=18SCORE=19




EXON CHAINING PROBLEM: FORMULATION

Exon Chaining Problem: Given a set of putative exons, find a maximum set of non-overlapping putative exons

Input: a set of weighted intervals (putative exons)

Output: A maximum chain of intervals from this set

Would a greedy algorithm solve this problem?




Use a graph representation of the exon chaining problem

Can be solved in O(n) time using dynamic programming

ExonChaining (G, n) //Graph, number of intervalsfor i ← to 2n si ← 0for i ← 1 to 2n if vertex vi in G corresponds to right end of the interval I j ← index of vertex for left end of the interval I w ← weight of the interval I si ← max {sj + w, si-1}else si ← si-1

return s2n

21

GREEDY: 17

BEST: 21




EXON CHAINING: DEFICIENCIES

Poor definition of the putative exon endpoints

Optimal chain of intervals may not correspond to any valid alignment

First interval may correspond to a suffix, whereas second interval may correspond to a prefix

Combination of such intervals is not a valid alignment

Human Genome

Fro

g G

enes (k

now

n)




SPLICED ALIGNMENT

Mikhail Gelfand and colleagues proposed a spliced alignment approach of using a protein within one genome to reconstruct the exon-intron structure of a (related) gene in another genome.

Begins by selecting either all putative exons between potential acceptor and donor sites or by finding all substrings similar to the target protein (as in the Exon Chaining Problem).

This set is further filtered in a such a way that attempt to retain all true exons, with some false ones.




SPLICED ALIGNMENT PROBLEM: FORMULATION

Goal: Find a chain of blocks in a genomic sequence that best fits a target sequence

Input: Genomic sequences G, target sequence T, and a set of candidate exons B.

Output: A chain of exons Γ such that the global alignment score between Γ* and T is maximum among all chains of blocks from B.

Γ* - concatenation of all exons from chain Γ




EXON CHAINING VS SPLICED ALIGNMENT

In Spliced Alignment, every path spells out the string obtained by concatenation of labels of its edges. The weight of the path is defined as optimal alignment score between concatenated labels (blocks) and target sequence

Defines weight of entire path in graph, but not the weights for individual edges.

Exon Chaining assumes the positions and weights of exons are pre-defined




© 2002 Nature Publishing Group704 | SEPTEMBER 2002 | VOLUME 3 www.nature.com/reviews/genetics

R E V I EW S

The advantage of HMMs is that more states (such asintergenic regions, promoters, UTRs, poly(A) andframe- or strand-dependent exons and introns) can beadded, as well as flexible transitions between the states,to allow partial transcripts, intronless genes or evenmultiple genes to be incorporated into a model.Multiple transcript predictions (which might corre-spond to alternatively spliced transcripts) can also beobtained by using sub-optimal parses. Because manyfunctional features that determine alternative splicinghave not been incorporated into existing programs, sub-optimal parses (or assignments) are unlikely to repre-sent alternative splicing events. Rather, they can serve as

boundaries, we refer to a region as a state and to aboundary as a transition between states). If the condi-tional probability P(s|q) of finding a base s in state q(which might depend on neighbouring bases as specifiedby the probability model) and the transition probabilityT(q|q!) of finding state q after state q!, for any possibleassignment (called a parse ") of states {qi: i = 1,2,…,N}(i enumerates positions) are known, the joint probabilityis given by P(", S) = P(s1|q1)T(q1|q2)P(s2|q2)… T(qN#1|qN)P(sN|qN)P0(qN). The Viterbi algorithm (DP for a HMM)can be used to find the most probable parse "* (REF. 47)

that corresponds to the optimal transcript (exon orintron) prediction.

Box 2 | Useful internet resources

Gene-prediction programs: comparative genomics Doublescan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.sanger.ac.uk/Software/analysis/doublescanSLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://bio.math.berkeley.edu/slamTwinscan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://genes.cs.wustl.edu

Gene-prediction programs (many with homology searching capabilities)GeneMachine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://genome.nhgri.nih.gov/genemachine Genscan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://genes.mit.edu/GENSCAN.html GenomeScan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://genes.mit.edu/genomescanFgenesh, Fgenes-M, TSSW, TSSG,Polyah, SPL and RNASPL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://genomic.sanger.ac.uk/gf/gf.shtmlFgenesh, Fgenes-M, SPL and RNASPL . . . . . . . . . . . . . . . . . . . . http://www.softberry.com/berry.phtmlHMMgene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.cbs.dtu.dk/services/HMMgene Genie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.fruitfly.org/seq_tools/genie.html GRAIL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://compbio.ornl.gov/tools/index.shtml GeneMark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.ebi.ac.uk/genemark [OK?]GeneID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www1.imim.es/software/geneid/geneid.html#top GeneParser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://beagle.colorado.edu/~eesnyder/GeneParser.html MZEF and POMBE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://argon.cshl.org/genefinder/ [OK?]AAT, MZEF with homology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://genome.cs.mtu.edu/aat.html MZEF with SpliceProximalCheck . . . . . . . . . . . . . . . . . . . . . . . . http://industry.ebi.ac.uk/~thanaraj/MZEF-SPC.html Genesplicer, Glimmer and GlimmerM. . . . . . . . . . . . . . . . . . . . http://www.tigr.org/~salzbergWebGene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.itba.mi.cnr.it/webgeneGenLang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.cbil.upenn.edu/genlang/genlang_home.html Xpound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ftp://igs-server.cnrs-mrs.fr/pub/Banbury/xpound

Gene-prediction programs: alignment basedProcrustes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www-hto.usc.edu/software/procrustes/index.html GeneWise2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.sanger.ac.uk/Software/Wise2 SplicePredictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://bioinformatics.iastate.edu/cgi-bin/sp.cgi PredictGenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://cbrg.inf.ethz.ch/subsection3_1_8.html

Finding ORFs and splice sitesDioGenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.cbc.umn.edu/diogenes/index.html OrfFinder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.ncbi.nlm.nih.gov/gorf/gorf.html YeastGene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://tubic.tju.edu.cn/cgi-bin/Yeastgene.cgi CDS: search coding regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://bioweb.pasteur.fr/seqanal/interfaces/cds-simple.html Neural network splice site prediction . . . . . . . . . . . . . . . . . . . . . http://www.fruitfly.org/seq_tools/splice.html NetGene2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.cbs.dtu.dk/services/NetGene2

Last exon,promoter or TSS predictionFirstEF, Core_Promoter, CpG_Promoter,Polyadq and JTEF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.cshl.edu/mzhanglabEponine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://www.sanger.ac.uk/Users/td2/eponineNeural network promoter prediction . . . . . . . . . . . . . . . . . . . . . http://www.fruitfly.org/seq_tools/promoter.html Transcription element search system . . . . . . . . . . . . . . . . . . . . . http://www.cbil.upenn.edu/tessSignal Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . http://bimas.dcrt.nih.gov/molbio/signal

AAT, analysis and annotation tool;ORF,open reading frame; TSS; transcription start site.





GENERAL THINGS TO REMEMBER ABOUT (PROTEIN-CODING) GENE PREDICTION SOFTWARE

It is, in general, organism-specific

It works best on genes that are reasonably similar to something seen previously

It finds protein coding regions far better than non-coding regions

In the absence of external (direct) information, alternative forms will not be identified

It is imperfect! (It’s biology, after all…)

HTTP://HARLEQUIN.JAX.ORG/GENOMEANALYSIS/GENEFINDING04.PPT





computational gene prediction - university of … · 2011-05-24 · this makes computational gene...

Documents