current methods of gene prediction, their strengths and weaknesses

SURVEY AND SUMMARY

Current methods of gene prediction, their strengthsand weaknessesCatherine MatheÂ*, Marie-France Sagot1, Thomas Schiex2 and Pierre RouzeÂ3

Institut de Pharmacologie et Biologie Structurale, UMR 5089, 205 route de Narbonne, F-31077 Toulouse Cedex,France, 1INRIA RhoÃne-Alpes, UMR 5558 BiomeÂtrie et Biologie EÂ volutive, UniversiteÂ Claude Bernard, Lyon I,43 Boulevard du 11 Novembre, F-69622 Villeurbanne Cedex, France, 2INRA Toulouse, DeÂpartement de BiomeÂtrieet Intelligence Arti®cielle, Chemin de Borde Rouge, BP 27, F-31326 Castanet-Tolosan Cedex, France and3Laboratoire AssocieÂ de l'INRA (France), Universiteit Gent, Ledeganckstraat 35, B-9000 Gent, Belgium

Received May 28, 2002; Revised and Accepted August 7, 2002

ABSTRACT

While the genomes of many organisms have beensequenced over the last few years, transformingsuch raw sequence data into knowledge remains ahard task. A great number of prediction programshave been developed that try to address one part ofthis problem, which consists of locating the genesalong a genome. This paper reviews the existingapproaches to predicting genes in eukaryoticgenomes and underlines their intrinsic advantagesand limitations. The main mathematical models andcomputational algorithms adopted are also brie¯ydescribed and the resulting software classi®edaccording to both the method and the type ofevidence used. Finally, the several dif®culties andpitfalls encountered by the programs are detailed,showing that improvements are needed and thatnew directions must be considered.

INTRODUCTION

The 21st century has seen the announcement of the draftversion of the human genome sequence (1). Model organismshave been sequenced in both the plant (2,3) and animalkingdoms (4) and, currently, more than 60 eukaryoticgenome sequencing projects are underway (see http://igweb.integratedgenomics.com/GOLD/).

However, biological interpretation, i.e. annotation, is notkeeping pace with this avalanche of raw sequence data. Thereis still a real need for accurate and fast tools to analyze thesesequences and, especially, to ®nd genes and determine theirfunctions. Unfortunately, ®nding genes in a genomic sequence(in this paper, we will only discuss genes encoding proteins) isfar from being a trivial problem (5). The widely used andrecognized approach for genome annotation (6) consists ofemploying, ®rst, homology methods, also called `extrinsicmethods' (7), and, second, gene prediction methods or`intrinsic methods' (8,9). Indeed, it seems that only approxi-mately half of the genes can be found by homology to other

known genes or proteins (although this percentage is of courseincreasing as more genomes get sequenced). In order todetermine the 50% of remaining genes, the only solution isto turn to predictive methods and to elaborate fast, accurateand reliable gene ®nders (10).

Many gene prediction programs are currently publiclyavailable. Most of them are referenced in the Web sitemaintained by W. Li (http://linkage.rockefeller.edu/wli/gene/). Several reviews have also been written on this topic,among which the most recent are Claverie (11), GuigoÂ (12),Haussler (13) and Burge and Karlin (14). In the last 15 years, acompetitive spirit has appeared and an ever increasing numberof programs are thus being created, updated and adapted fromone organism to another. While such a scienti®c rush helps toimprove the quality of existing programs, it is also confusingfor the users, who wonder what makes the programs different,which one they should use in which situation and for eachwhat the prediction con®dence is. This last question wasaddressed, in the case of vertebrate genomes, by Burset andGuigoÂ (15) and recently by Rogic et al. (16), and in the case ofa plant model genome (of Arabidopsis thaliana) by Pavy et al.(17). Finally, users also wonder whether current programs cananswer all their questions (11). The answer is most probablyno, and will remain no as it is unrealistic to imagine that suchcomplex biological processes as transcription and translationcan be explained merely by looking at the DNA sequence.

In this review, we start by giving a general overview of theclassical gene ®nding approaches, without going too deeplyinto the mathematical methods and algorithms themselves (theinterested reader is invited to have a look at the originalpapers). For the sake of exposition, the approaches are dividedinto ®nding the evidence for a gene and combining theevidence in order to better predict the structure of a gene. Weend by focusing on the many remaining problems presented bythe currently available gene prediction methods.

FINDING THE EVIDENCE

We consider in this paper the problem of ®nding genes codingfor a protein sequence in eukaryotes only. The problem of®nding genes in prokaryotes presents different types of

*To whom correspondence should be addressed. Tel: +33 5 61 17 59 53; Fax: +33 5 61 17 59 94; Email: [email protected]

ã 2002 Oxford University Press Nucleic Acids Research, 2002, Vol. 30 No. 19 4103±4117

at UC

SF Library and C

enter for Know

ledge Managem

ent on Decem

ber 10, 2014http://nar.oxfordjournals.org/

Dow

nloaded from

http://nar.oxfordjournals.org/

dif®culties (there are no introns and the intergenic regions aresmall, but genes may often overlap each other and thetranslation starts are dif®cult to predict correctly).Functionally, a eukaryotic gene can be de®ned as beingcomposed of a transcribed region and of regions that cis-regulate the gene expression, such as the promoter regionwhich controls both the site and the extent of transcription andis mostly found in the 5¢ part of the gene. The currentlyexisting gene prediction software look only for the transcribedregion of genes, which is then called `the gene'. We will adoptthis de®nition of a gene in this review; the region between twotranscribed regions will be called intergenic. In the currentpractice, the promoter is seen as sitting in the intergenicregion, immediately upstream of the gene and not overlappingwith it. This is also a simpli®cation of reality. A gene is furtherdivided into exons and introns, the latter being removed duringthe splicing mechanism that leads to the mature mRNA.Although some exons (or parts of them) may be non-coding,most gene ®nding software use the term exon to denote thecoding part of the exons only. In this review, however, we willrefer to the correct biological de®nition of exons and explicitlymention when only the coding part of them is concerned.Indeed, in the mature mRNA, the untranslated terminalregions (UTRs) are the non-coding transcribed regions,which are located upstream of the translation initiation (5¢-UTR) and downstream (3¢-UTR) of the translation stop. Theyare known to play a role in the post-transcriptional regulationof gene expression, such as the regulation of translation andthe control of mRNA decay (18). Inside or at the boundaries ofthe various genomic regions, speci®c functional sites (orsignals) are documented to be involved in the various levels ofprotein encoding gene expression, e.g. transcription (tran-scription factor binding sites and TATA boxes), splicing(donor and acceptor sites and branch points), polyadenylation[poly(A) site], translation (initiation site, generally ATG withexceptions, and stop codons).

Essentially, two different types of information are currentlyused to try to locate genes in a genomic sequence. (i) Contentsensors are measures that try to classify a DNA region intotypes, e.g. coding versus non-coding. Historically, the exist-ence of a suf®cient similarity with a biologically characterizedsequence has been the main means of obtaining such aclassi®cation. Similarity-based approaches have often beencalled extrinsic in opposition to others that try to capture someof the intrinsic properties of the coding/non-coding sequences(compositional bias, codon usage, etc). (ii) Signal sensors aremeasures that try to detect the presence of the functional sitesspeci®c to a gene.

Content sensors

Extrinsic content sensors. Extrinsic content sensors simplyexploit a suf®cient similarity between a genomic sequenceregion and a protein or DNA sequence present in a database inorder to determine whether the region is transcribed and/orcoding. The basic tools for detecting suf®cient similaritybetween sequences are local alignment methods ranging fromthe optimal Smith±Waterman algorithm to fast heuristicapproaches such as FASTA (19) and BLAST (20). Besidesthe fact that some databases may contain information of poorquality (a topic that will be discussed later), and independentlyof the type of similarities that are considered, the obvious

weakness of such extrinsic approaches is that nothing will befound if the database does not contain a suf®ciently similarsequence. Furthermore, even when a good similarity is found,the limits of the regions of similarity, which should indicateexons, are not always very precise and do not enable anaccurate identi®cation of the structure of the gene. Smallexons are also easily missed.

Overall, similarities with three different types of sequencesmay provide information about exon/intron locations. The ®rstand most widely used are protein sequences that can be foundin databases such as SwissProt or PIR. It is estimated thatalmost 50% of the genes can be identi®ed thanks to a suf®cientsimilarity score with a homologous protein sequence.However, even when a good hit is obtained, a completeexact identi®cation of the gene structure can still remaindif®cult because homologous proteins may not share all oftheir domains. Furthermore, UTRs cannot be delimited in thisway.

The second type of sequences are transcripts, sequenced ascDNAs (a cDNA is a DNA copy of a mRNA) either in theclassical way for targeted individual genes with high coveragesequencing of the complete clone or as expressed sequencetags (ESTs), which are one shot sequences from a wholecDNA library. ESTs and `classical' cDNAs are the mostrelevant information to establish the structure of a gene,especially if they come from the same source as the genome tobe annotated. ESTs provide information that enable theidenti®cation of (partial) exons, either coding or non-coding,and give unbiased hints on alternative splicing. However,ESTs give only local and limited information on the genestructure as they only re¯ect a partial mRNA. Furthermore, thecorrect attribution of EST sequences to an individual memberin a gene family is not a trivial task.

Finally, under the assumption that coding sequences aremore conserved than non-coding ones, similarity withgenomic DNA can also be a valuable source of informationon exon/intron location. Two approaches are possible: intra-genomic comparisons can provide data for multigenicfamilies, apparently representing a large percentage of theexisting genes (e.g. 80% for Arabidopsis); inter-genomic(cross-species) comparisons can allow the identi®cation oforthologous genes, even without any preliminary knowledgeof them. Nevertheless, the similarity may not cover entirecoding exons but be limited to the most conserved part ofthem. Alternatively, it may sometimes extend to introns and/orto the UTRs and promoter elements. This will be the casewhen genomes are evolutionarily close or when genomeduplications are recent events. In both cases, exactly discrim-inating between coding and non-coding sequences is not anobvious task.

In all cases, an important strength of similarity-basedapproaches is that predictions rely on accumulated pre-existing biological data (with the caveat mentioned later ofpossible poor database quality). They should thus producebiologically relevant predictions (even if only partial).Another important point is that a single match is enough todetect the presence of a gene, even if it is not canonical. Aninteresting study on the human genome concluded that ESTdatabases represent an effective general purpose probe forgene detection in the human genome (21). The authors of thestudy stress, however, the fact that genes expressed either

4104 Nucleic Acids Research, 2002, Vol. 30 No. 19

at UC

SF Library and C

enter for Know

ledge Managem

ent on Decem


Dow

nloaded from


under very speci®c conditions or at a low level are generallynot represented in EST databases.

Intrinsic content sensors. Originally, intrinsic content sensorswere de®ned for prokaryotic genomes. In such genomes, onlytwo types of regions are usually considered: the regions thatcode for a protein and will be translated, and intergenicregions. Since coding regions will be translated, they arecharacterized by the fact that three successive bases in thecorrect frame de®ne a codon which, using the genetic coderules, will be translated into a speci®c amino acid in the ®nalprotein.

In prokaryotic sequences, genes de®ne (long) uninterruptedcoding regions that must not contain stop codons. Therefore,the simplest approach for ®nding potential coding sequences isto look for suf®ciently long open reading frames (ORFs),de®ned as sequences not containing stops, i.e. as sequencesbetween a start and a stop codon. In eukaryotic sequences,however, the translated regions may be very short and theabsence of stop codons becomes meaningless (22).

Several other measures have therefore been de®ned that tryto more ®nely characterize the fact that a sequence is `coding'for a protein: nucleotide composition and especially (G+C)content (introns being more A/T-rich than exons, especially inplants), codon composition, hexamer frequency, base occur-rence periodicity, etc. Among the large variety of codingmeasures that have been tested, hexamer usage (i.e. usage of6 nt long words) was shown in 1992 to be the mostdiscriminative variable between coding and non-codingsequences (23). This characteristic has been widely exploitedby a large number of algorithms through different methods.

Thus, hexamer frequency is one of the main variables usedin SORFIND (24), Genview2 (25), the quadratic discriminantanalysis approach of MZEF (26) and the neural networkprocedure of GeneParser (27). This last program combines theuse of hexamer frequency with local compositional complex-ity measures estimated on octanucleotide statistics. Suchstatistics are also ef®ciently used, among other variables, inthe linear discriminant analysis of Solovyev's GeneFinder(28).

More generally, the kmer composition of coding sequencesis the basis of the now ubiquitous so-called `three-periodicMarkov model' introduced in the pioneering algorithmGeneMark (29). Very brie¯y, a Markov model is a stochasticmodel which assumes that the probability of appearance of agiven base (A, T, G or C) at a given position depends only onthe k previous nucleotides (k is called the order of the Markovmodel). Such a model is de®ned by the conditional probabil-ities P(X|k previous nucleotides), where X = A, T, G or C. Inorder to build a Markov model, a learning set of sequences onwhich these probabilities will be estimated is required. Given asequence and a Markov model, one can then very simplycompute the probability that this sequence has been generatedaccording to this model, i.e. the likelihood of the sequence,given the model.

The simplest Markov models are homogeneous zero orderMarkov models which assume that each base occurs inde-pendently with a given frequency. Such simple models areoften used for non-coding regions, although it is now frequentto use higher order models to represent introns and intergenicregions as, for instance, in GeneMark, Genscan (30) and

EuGeÁne (31). The more complex three-periodic Markovmodels have been introduced to characterize coding se-quences. Coding regions are de®ned by three Markov models,one for each position inside a codon.

The larger the order of a Markov model, the ®ner it cancharacterize dependencies between adjacent nucleotides.However, a model of order k requires a very large numberof coding sequences to be reliably estimated. Therefore, mostexisting gene prediction programs, such as GeneMark andGenscan, usually rely on a three-periodic Markov model oforder ®ve (thus exploiting hexamer composition) or less tocharacterize coding sequences. To cope with these limitations,interpolated Markov models (IMMs) have been introduced inthe prokaryotic gene ®nder Glimmer (32). For each condi-tional probability, an IMM combines statistics from severalMarkov models, from order zero to a given order k (typically k= 8), according to the information available. These IMMs arenow also used in GlimmerM (33), a version dedicated toeukaryotes, and in EuGeÁne. The new version of Glimmerintroduces yet another sophistication of Markov models calledinterpolated context models, which can capture dependenciesamong 12 adjacent nucleotides (34).

Another type of re®nement is often needed in eukaryoticgenomes. It consists of estimating several gene modelsaccording to the G+C content of the genomic sequence. Thisis done by Genscan and GeneMark.hmm (35). Indeed, it wasshown that differences in gene structure and gene densityalong some genomes are closely related to their `isochore'organization (36±38).

In general, most currently existing programs use two typesof content sensors: one for coding sequences and one for non-coding sequences, i.e. introns, UTRs and intergenic regions. Afew software re®ne this by using a different model for thedifferent types of non-coding regions (e.g. one model forintrons, one for intergenic regions and an optional speci®c 3¢-and 5¢-UTR model in EuGeÁne).

Although these methods are considered as `intrinsic', thefact that the models are built from known sequences willinherently limit the applicability of the methods to sequencesthat, globally, behave in the same way as the learning set.

Signal sensors

The basic and natural approach to ®nding a signal that mayrepresent the presence of a functional site is to search for amatch with a consensus sequence (with possible variationsallowed), the consensus being determined from a multiplealignment of functionally related documented sequences. Thistype of method is used, for instance, for splice sites predictionin SPLICEVIEW (39) and SplicePredictor (40), the lattercombining it with other kinds of information.

A more ¯exible representation of signals is offered by theso-called positional weight matrices (PWMs), which indicatethe probability that a given base appears at each position of thesignal (again computed from a multiple alignment offunctionally related sequences). Equivalently, one can saythat a PWM is de®ned by one classical zero order Markovmodel per position, which is called an inhomogeneous zeroorder Markov model. The PWM weights can also beoptimized by a neural network method, as proposed byBrunak et al. (41) for NetPlantGene (42) and NetGene2 (43)and used in NNSplice (44).

Nucleic Acids Research, 2002, Vol. 30 No. 19 4105

at UC

SF Library and C

enter for Know

ledge Managem

ent on Decem


Dow

nloaded from


In order to capture possible dependencies between adjacentpositions of a signal, one may use higher order Markovmodels. The so-called weight array model (WAM) is essen-tially an inhomogeneous higher order Markov model. It was®rst proposed by Zhang and Marr (45) and later used bySalzberg (46), who applied it in the VEIL (47) and MORGAN(48) software. Genscan also uses a modi®ed WAM to modelacceptor splice sites and a second order WAM to representbranch point information. This is closely related to theposition-dependent triplet frequency model employed byMZEF for the same signal.

These methods assume a ®xed length signal. HiddenMarkov models (HMMs) [see the tutorial from Rabiner (49)and, for instance, Krogh (50) for applications to biologicalsequences] further allow for insertions and deletions. Theyhave been used in NetGene2 to model the branch point signal.In order to capture the most signi®cant dependencies betweenadjacent as well as non-adjacent positions, Burge proposedanother model for donor sites called the maximal dependencedecomposition (MDD) method.

Most existing programs use such models to represent anddetect splice sites. An alphabetical list of currently availablesplice site detection programs is presented in Table 1. Theseprograms can already integrate the output of several signalsensors, or even signal and content sensors, as for exampleNetGene2, which combines splice site signal and branch pointsignal sensors with a global coding content sensor through aneural network. A similar approach is used in the recentprogram GeneSplicer, which combines MDD models forsplice sites with second order Markov models that characterizecoding/non-coding regions around splice sites. Recently, itwas shown that combining sequence-based metrics for splicesites (WAM) with secondary structure metrics could lead tovaluable improvements in splice site prediction (51).

However, when using splice site prediction programs, oneends up with a list of potential splice sites, from which variousgene structures may be built. The main purpose of suchprograms is not to ®nd the gene structure but to try to ®nd thecorrect exon boundaries. They are thus very useful in additionto an exon or gene predictor in order to re®ne an existing genestructure. These programs can also provide insights intopossible alternative splicing, even if, so far, this possibility hasbeen very poorly investigated.

Finally, HMMs have also been used to represent other typesof signals, such as poly(A) sites (in 3¢-UTRs), promoters, etc.Although promoter detection is, by nature, closely related togene detection, we will not discuss it in this paper, as itrepresents, on its own, an important area of research incomputational biology (52). Recently, Pedersen et al. (53)reviewed the (known) biology of eukaryotic promoters and themany dif®culties encountered in predicting them. As for theprevious `intrinsic' content sensors, the fact that HMMs arebuilt from a multiple alignment of known functionalsequences inherently limits the sensors to canonical signals.

Another important signal to identify when trying to predicta coding sequence is the translation initiation codon. A fewprograms exist speci®cally dedicated to this problem (54±56),but most of them have a rather limited ef®ciency, which ismaybe related to the lack of proper learning sets for eukaryoticgenomes. Experimental information on the genuine location oftranslation starts has indeed been scarce up to now, a situationthat will likely change soon with the advent of proteome data.

COMBINING THE EVIDENCE TO PREDICT GENESTRUCTURES

Since 1990, following the example of Fields and Soderlund(57) and of Gelfand (58), programs are no longer limited tosearching for independent exons, but try instead to identify thewhole complex structure of a gene. Given a sequence andusing signal sensors, one can accumulate evidence on theoccurrence of signals: translation starts and stops and splicesites are the most important ones since they de®ne theboundaries of coding regions. In theory, each consistent pair ofdetected signals de®nes a potential gene region (intron, exonor coding part of an exon). If one considers that all thesepotential gene regions can be used to build a gene model, thenumber of potential gene models grows exponentially with thenumber of predicted exons. In practice, this is slightly reducedby the fact that `correct' gene structures must satisfy a set ofproperties: (i) there are no overlapping exons; (ii) codingexons must be frame compatible; (iii) merging two successivecoding exons will not generate an in-frame stop at thejunction. The number of candidates remains, however,exponential. In almost all existing approaches, such an

Table 1. Splice site prediction programs

Program Organism Method

GeneSplicer (152) Arabidopsis, human HMM + MDDNETPLANTGENE (42)(http://www.cbs.dtu.dk/services/NetPGene/)

Arabidopsis NN

NETGENE2 (43)(http://www.cbs.dtu.dk/services/NetGene2/)

Human, C.elegans, Arabidopsis NN + HMM

SPLICEVIEW (39)(http://l25.itba.mi.cnr.it/~webgene/wwwspliceview.html)

Eukaryotes Score with consensus

NNSPLICE0.9 (44)(http://www.fruit¯y.org/seq_tools/splice.html)

Drosophila, human or other NN

SPLICEPREDICTOR (40,153)(http://bioinformatics.iastate.edu/cgi-bin/sp.cgi)

Arabidopsis, maize Logitlinear models: (i) score withconsensus; (ii) local composition

BCM-SPL(http://www.softberry.com/berry.phtml; http://genomic.sanger.ac.uk/gf/gf.html)

Human, Drosophila, C.elegans,yeast, plant

Linear discriminant analysis

HMM, hidden MM; MDD, maximal dependence decomposition; NN, neural networks.


at UC

SF Library and C

enter for Know

ledge Managem

ent on Decem


Dow

nloaded from


exponential number is coped with in reasonable time by usingdynamic programming techniques.

Until recently, prediction methods that try to determine thewhole gene structure, i.e. to assemble all the pieces, could beseparated into two classes depending on whether the contentof exon/intron regions was assessed using extrinsic or intrinsiccontent sensors. We ®rst consider each of these approaches inturn and then see how more recent programs try to combineevidence coming from both intrinsic and extrinsic contentsensors.

Extrinsic approaches

Pioneered by Gelfand et al. (59) with Procrustes, manysoftware based on similarity searches have emerged during thelast 5 years. They are presented in Table 2. As mentionedbefore, one of the main weaknesses of the pure similarity-based content sensors is that the limits of similarities are neveraccurately de®ned. The principle of most of these programs isto combine similarity information with signal informationobtained by signal sensors. This information will be used tore®ne the region boundaries. These programs inherit all thestrengths and weaknesses of the sensors used and may, forexample, fail when non-canonical splice sites are present.

Very brie¯y, all the programs in this class may be seen assophistications of the traditional Smith±Waterman localalignment algorithm where the existence of a signal allowsfor the opening (donor) or closure (acceptor) of a gap with anessentially free extension cost. They are often referred to as`spliced alignment' programs. Existing software may befurther divided according to the type of similarity exploited:genomic DNA/protein, genomic DNA/cDNA or genomicDNA/genomic DNA. Some of these methods are able to dealwith more than one type and to take into account possibleframeshifts in the genomic DNA or cDNA sequences.

The purpose of Procrustes is to align a genomic sequencewith a protein. The selection of the target protein is left to theuser, who may retrieve it from a BLASTX search for instance.Procrustes then considers all potential exons from the queryDNA sequence, initially with the only constraint that theymust be bordered by donor and acceptor sites [according toGelfand et al. (59), it is recommended to additionally usecontent sensors, although this does not seem to be done in theprogram available on the Web]. All possible exon assembliesare explored by translating the exons and aligning them withthe target protein, using the PAM120 matrix for scoringmismatches. This is done in a time proportional to the productof the lengths of the query and target sequences. As a result, itproduces an assembly with the highest similarity score to thetarget protein. Other programs performing the same task areGeneWise (60), PredictGenes, ORFgene (61) and ALN (62),the latter introducing a new code of 23 letters for translatedcodons (called a tron).

Some programs, like INFO (63) and ICE (64), use adictionary-based approach: they ®rst create dictionaries ofk long segments from a protein or an EST database and then,using a look-up procedure, ®nd all segments in the query DNAsequence having a match in the dictionary. A relatedalternative is used in the mixture model of Thayer et al.(65), where only the highly populated protein segments,derived from multiple protein alignments, are stored and thenused to develop statistical models.

Other available programs are AAT (66), GeneSeqer(67,68), SIM4 (69) and Spidey (70), all of which perform analignment of the genomic DNA sequence against a cDNAdatabase. This is a very reliable way of identifying exons,independently of their coding status, especially when thegenomic sequence is aligned against a cDNA from the same ora close organism (71). Dif®culties may, however, be encoun-tered when trying to delineate the UTR part of the genes, andthus the correct translation initiator and stop codons. AAT andGeneSeqer also allow for the alignment against a proteindatabase.

The SYNCOD program (72) uses a more original proced-ure: it compares two DNA sequences (retrieved withBLASTN) and, for each ORF contained in a high scoringpair, analyzes the ratio of the number of mismatches resultingin synonymous (silent) codons to the number of mismatchesresulting in non-synonymous codons. An ORF is thenassumed to be a potential coding region if this ratiosigni®cantly deviates from a random behavior, simulatedwith a Monte Carlo procedure.

The approach adopted is rather different for programswhich try to elucidate the gene structure from ESTmatches, like EbEST (73), Est2genome (74) or, morerecently, TAP (75) and PAGAN (http://ismb01.cbs.dtu.dk/GeneFinding.html#A308). The reason for this comes from thespeci®c nature of an EST. A ®rst characteristic of ESTs is thatthey are very redundant and a large number of them may beretrieved when performing a BLAST search against dbEST.EbEST faces this problem in its ®rst step by clustering ESTsinto non-overlapping groups (PAGAN clusters alignments ofESTs obtained from the results of similarity searches) and thenby selecting the most informative ESTs within each group. Asecond characteristic of ESTs is that they are naturally errorprone since they are generated from single-read sequences.The Smith±Waterman algorithm used in EbEST tolerates thepresence of such errors. Another characteristic of ESTs is thatmost of them are 3¢ ESTs generated from oligo(dT)-primedcDNA libraries and are therefore useful for detecting the 3¢-UTRs in long sequences. This last point is an important addedvalue for the EST-driven gene modeling approaches, as itleads to a rather con®dent prediction of a gene 3¢-end. Thisimportance may be somewhat weakened by the fact that ESTsrepresent only partial mRNA sequences and even clusters ofESTs may not lead to the complete identi®cation of the genestructure.

Finally, several programs try to retrieve information onconservation or synteny between organisms from genomicalignments, as, for example, MUMmer (76), WABA (77),PipMaker (78) and DIALIGN (79). Recently, a few algorithmshave appeared that focus more speci®cally on thegene recognition problem by comparison of two genomicsequences: such programs are based on the hypothesisthat coding DNA sequences are more conserved than non-coding sequences (intronic and intergenic). Comparing twohomologous genomic sequences (cross- or intra-species)should thus help to reveal conserved exons and allow theprediction of genes simultaneously on both sequences. Someprograms like ROSETTA (80) and CEM (81) are morespeci®cally designed for the comparison of closely relatedspecies. In particular, they make the hypothesis of conservedexon±intron structure in the two sequences. ROSETTA makes


at UC

SF Library and C

enter for Know

ledge Managem

ent on Decem


Dow

nloaded from


Tab

le2

.H

om

olo

gy

-bas

edg

ene

pre

dic

tio

np

rog

ram

s

Pro

gra

mO

rgan

ism

Dat

aban

kor

requir

edin

put

Ali

gnm

ent

Gen

ere

const

ruct

ion

AA

T(6

6)

(htt

p:/

/gen

om

e.cs

.mtu

.ed

u/a

at.h

tml)

Pri

mat

es,

roden

ts,

oth

ercD

NA

,pro

tein

DD

S(i

mpro

ved

BL

AS

TX

),D

PS

(im

pro

ved

BL

AS

TN

)N

AP

,G

AP

2

AL

N(6

2)

Pro

tein

Tro

nco

de,

PA

M250

CE

M(8

1)

Tw

ogen

om

icse

quen

ceB

LA

ST

Xoutp

ut,

WM

Mfo

rsi

tes

DP

Eb

ES

T(7

3)

(htt

p:/

/are

s.if

rc.m

cw.e

du

/EB

ES

T/e

bes

t.h

tml)

Hum

an,

oth

erdbE

ST

BL

AS

TN

,E

ST

clust

erin

g,

Sm

ith±W

ater

man

-bas

edgap

ped

alig

nm

ent

3¢-U

TR

det

ecti

on,

asse

mbly

of

ES

T-t

agged

exons

Est

2gen

om

e(7

4)

ES

Tor

cDN

A,

pre

fera

bly

BL

AS

TN

outp

ut

Modi®

edS

mit

h±W

ater

man

Nee

dle

man

±W

unsh

algori

thm

No

Gen

eSeq

er(6

7,6

8)

(htt

p:/

/bio

info

rmat

ics.

iast

ate.

edu/c

gi-

bin

/gs.

cgi)

Ara

bid

opsi

s,m

aize

,gen

eric

pla

nt

dbE

ST

or

ES

Tdat

abas

eor

pro

tein

sS

pli

ced

alig

nm

ent,

spli

cere

cognit

ion

wit

hS

pli

ceP

redic

tor

ifm

issi

ng

ES

Tm

atch

Yes

Gen

eWis

e(6

0)

(htt

p:/

/ww

w.s

ang

er.a

c.u

k/S

oft

war

e/W

ise2

/gen

ewis

efo

rm.s

htm

l)H

um

anO

ne

pro

tein

or

aH

MM

pro

®le

Glo

bal

alig

nm

ent

tran

slat

edO

RF

/pro

tein

DP

(dynam

ite)

GE

NQ

UE

ST

(htt

p:/

/co

mp

bio

.orn

l.g

ov

/Gra

il-b

in/E

mp

tyG

enq

ues

tFo

rm)

dbE

ST

,S

wis

sPro

t,P

rosi

te,

BL

OC

KS

,G

SD

BS

mit

h±W

ater

man

,B

last

,F

asta

ICE

(64

)(h

ttp

://t

heo

ry.l

cs.m

it.e

du

/ice

)dbE

ST

,O

WL

Look-u

pD

P

INF

O(6

3)

Nr

25m

erlo

ok-u

pta

ble

,pro

tein

/pro

tein

alig

nm

ents

score

dw

ith

PA

M40,

PA

M120,

PA

M250,

BL

O62

No

OR

Fgen

e2(6

1)

(htt

p:/

/l2

5.i

tba.

mi.

cnr.

it/~

web

gen

e/w

ww

orf

gen

e2.h

tml)

Hum

an,

mouse

,D

roso

phil

a,

Asp

ergil

lus,

Ara

bid

opsi

s,C

aen

orh

abdit

is

Sw

issP

rot

Bla

stP

,W

AM

for

spli

cesi

tes,

iden

tity

score

on

freq

uen

cies

of

dip

epti

des

Com

pat

ibil

ity

gra

ph,

DP

Pre

dic

tGen

es(h

ttp

://c

brg

.in

f.et

hz.

ch/S

erv

er/s

ub

sect

ion

3_

1_

8.h

tml)

Inver

tebra

tes,

ver

tebra

tes,

pro

kar

yote

s,pla

nts

Sw

issP

rot

PA

M250

DP

PR

OC

RU

ST

ES

(59

)(h

ttp

://w

ww

-hto

.usc

.ed

u/s

oft

war

e/p

rocr

ust

es/w

ww

serv

.htm

l)V

erte

bra

tes

One

hom

olo

gous

pro

tein

Pro

tein

/pro

tein

alig

nm

ents

score

dw

ith

PA

M120

DP

Pro

-Gen

(83

)(h

ttp

://w

ww

.an

cho

rgen

.co

m/p

ro_

gen

/pro

_gen

.htm

l)T

wo

gen

om

icse

quen

ces

Ali

gnm

ent

of

tran

slat

edse

quen

ces

score

dw

ith

PA

M120

DP

RO

SE

TT

A(8

0)

(htt

p:/

/cro

sssp

ecie

s.lc

s.m

it.e

du

/)H

um

an,

mouse

Tw

ogen

om

icse

quen

ces

GL

AS

S(g

lobal

alig

nm

ent

syst

em),

PA

M20,

Gen

scan

met

hod

for

spli

cesi

tes

DP

SG

P-1

(82

)(h

ttp

://s

oft

.ice

.mp

g.d

e/sg

p-1

)V

erte

bra

tes,

angio

sper

ms

Tw

ogen

om

icse

quen

ces

or

apai

rwis

elo

cal

alig

nm

ent

outp

ut

Loca

lal

ignm

ent

DP

SIM

4(6

9)

All

eukay

ote

scD

NA

/gen

om

icH

SP

from

Bla

stN

oS

LA

M(8

5)

(htt

p:/

/bab

oon

.mat

h.b

erk

eley

.ed

u/~

syn

ten

ic/s

lam

.htm

l)H

um

an,

mouse

Tw

ogen

om

icse

quen

ces

Gen

eral

ized

pai

rH

MM

DP

Sp

idey

(70

)(h

ttp

://w

ww

.ncb

i.n

lm.n

ih.g

ov

/IE

B/R

esea

rch

/Ost

ell/

Sp

idey

/in

dex

.htm

l)V

erte

bra

tes,

Dro

sophil

a,

C.e

legans,

pla

nt

One

gen

om

icse

quen

ce/s

etof

mR

NA

sT

wo

Bla

sts:

hig

hst

ringen

cyan

dlo

wst

ringen

cyS

YN

CO

D(7

2)

(htt

p:/

/l2

5.i

tba.

mi.

cnr.

it/~

web

gen

e/w

ww

syn

cod

.htm

l)H

um

an,

mouse

,D

roso

phil

a,

Ara

bid

opsi

s,A

sper

gil

lus,

Caen

orh

abdit

is

BL

AS

TN

outp

ut

Sil

ent/

repla

cem

ent

rati

o,

Monte

Car

losi

mula

tions

No

TA

P(7

5)

(htt

p:/

/sap

ien

s.w

ust

l.ed

u/~

zkan

/TA

P/)

Hum

an,

mouse

,D

roso

phil

adbE

ST

WU

-BL

AS

TN

,S

IM4

Yes

Uto

pia

(84

)A

lleu

kar

yote

sT

wo

gen

om

icse

quen

ces

Loca

lal

ignm

ent

Yes

DP

,d

yn

amic

pro

gra

mm

ing

;W

AM

,w

eig

ht

arra

ym

atri

x;

WM

M,

wei

ght

mat

rix

met

hod.


at UC

SF Library and C

enter for Know

ledge Managem

ent on Decem


Dow

nloaded from


the further (very strong) hypothesis that the correspondingexons in the two genes have roughly the same length. More¯exibility is allowed by algorithms which do not assume thatthe gene structure is conserved, as in SGP-1 (82), Pro-Gen (83)and Utopia (84). All these programs, except ROSETTA, canperform the sequence comparison at the protein level. SGP-1is also able to use a nucleotide level alignment; it furthercombines sequence comparison with information comingfrom signal sensors. It seems nevertheless to perform worsethan Pro-Gen and Utopia on sequences not so closely relatedor for organisms for which it has not been speci®cally trained(P.Blayo, S.Aubourg, P.Peterlongo, P.RouzeÂ and M.F.Sagot,manuscript in preparation). However, SGP and a more recentversion of Utopia allow the prediction of partial genes as wellas multiple collinear genes in genomic sequences. Finally, anoriginal method was recently implemented in the SLAMprogram, which introduces a probabilistic cross-species gene®nding algorithm using generalized pair HMMs (85).

All the comparative genomic methods have, theoretically,the advantage of not being species speci®c. In practice, theirperformance will depend on the evolutionary distancebetween the compared sequences. Initial results show thatthe relationship is not straightforward. Indeed, a greaterevolutionary distance allows some algorithms to more accur-ately discriminate between coding and non-coding sequenceconservation. Such programs are often computer intensive andconsequently much work remains to be done. In particular, amajor challenge that could considerably improve the per-formance of gene ®nding programs would be to introducemultiple comparisons into these methods.

In order to retrieve only relevant information fromhomology searches against databases, the use of suchprograms must be coupled, if not integrated, with otherspeci®c programs to eliminate repeated sequences (SINES,LINES, etc.), which are very frequent in human genomicsequences (about one-quarter of the genome). Examplesof such programs are RepeatMasker (http://ftp.genome.washington.edu/RM/RepeatMasker.html) and CENSOR (86).

Intrinsic approaches

Unlike most of the `spliced alignment' approaches describedin the previous section, which aim at producing a (single) genestructure based on similarities to known sequences, intrinsicgene ®nders (Table 3) aim at locating all the gene elementsthat occur in a genomic sequence, including possible partialgene structures at the border of the sequence.

To ef®ciently deal with the exponential number of possiblegene structures de®ned by potential signals, almost allintrinsic gene ®nders use dynamic programming (DP) toidentify the most likely gene structures according to theevidence de®ned by both content and signal sensors. All suchgene modeling strategies can be formulated with a graphlanguage (87). Following GuigoÂ (88), such approaches aresaid to be exon based or signal based depending on whether agene structure is considered to be de®ned by an assembly ofsegments de®ning the coding part of the exons (exon based) orby the presence of a succession of signals separated by`homogeneous' regions.

In the exon-based category, the gene assembly is separatedfrom the coding segments prediction step. The goal is to ®ndthe highest scoring genes, the gene score being a simple

function (usually the sum) of the scores of the assembledsegments. The strength of this two-step process comes fromthe fact that the score of each segment can be quite complexand may depend on some global characteristics of the codingpart of exons, such as their lengths, for instance. In theory atleast, the segment assembly process can be de®ned as thesearch for an optimal path in a directed acyclic graph wherevertices represent exons and edges represent compatibilitybetween exons. This is the approach adopted by the GeneId(89), GenView2, GAP3 (90), FGENE (28) and DAGGER (91)programs, where the initially adopted algorithms run in a timeproportional to the square of the number of predictedsegments. Recently, the DP algorithm GenAmic (88) solvedthe problem in a running time that grew only linearly with thenumber of potential segments, which itself is in the worst casequadratic with the sequence length (however, the assumptionthat in-frame stop codons occur following a Poisson processsuf®ces to make the expected number of possible codingsegments linear with the length of the sequence). It is currentlyintegrated in the GeneId program as well as in FGENE,whereas the GeneGenerator (92) program has its own linearrunning time algorithm.

In the signal-based methods, the gene assembly is produceddirectly from the set of detected signals. In the simplest signal-based methods, there is an implicit assumption that the`content' score of a segment is de®ned as the sum of the local(nucleotide-based) content scores and therefore does notdepend on the global characteristics of the segment. This is,for instance, the case in a basic HMM. In such models, a givensegment is considered to be generated by a Markov modelassociated with a state of the automaton (e.g. coding or non-coding). Since such states are unknown, they are called hiddenstates. In a given state, the `content' score of a segment isde®ned as the log-likelihood of the region according to thecorresponding Markov model, i.e. as the sum of the logarithmsof the probabilities that each nucleotide appears given the kprevious nucleotides in the model. Thanks to this assumption,and in theory at least, the gene parsing can be de®ned as thesearch for an optimal path in a directed acyclic graph. Thissearch is done using the famous Viterbi algorithm (93), whichproduces a most likely gene structure and can be considered asa speci®c instance of the older Bellman shortest pathalgorithm (94), also used in the ®rst versions of EuGeÁne. Itsrunning time grows linearly with the length of the sequence.Krogh has developed the ®rst gene ®nder using a HMM,ECOPARSE (95), for Escherichia coli. The basic HMMmodel can be made more sophisticated by taking into accountthe length of the regions in the score. However, this leads to aquadratic running time. Additional assumptions make thealgorithms usable in practice. This results in the so-calledhidden semi-Markov models or generalized HMM. Suchmodels are used in HMM-based programs like Genscan,Genie (44,96), GeneMark.hmm, FGENESH (A.Salamov andV.Solovyev, unpublished data; see http://genomic.sanger.ac.uk/) and GRPL (97). HMMgene (98) and VEIL employ aslightly different method called CHMM, for class HMM.Interestingly, the GRPL program introduces a new classi®ca-tion method for functional sites surrounding exons, called theregression point logistic. Other kinds of classi®ers havealready been integrated in some gene ®nders, such as a


at UC

SF Library and C

enter for Know

ledge Managem

ent on Decem


Dow

nloaded from


Tab

le3

.A

bin

itio

gen

ep

red

icti

on

pro

gra

ms

(po

ssib

lyw

ith

ho

molo

gy

inte

gra

tion)

Pro

gra

mO

rgan

ism

Gen

eel

emen

tsG

ene

model

Hom

olo

gy

DA

GG

ER

(91

)S

ite

score

sD

irec

ted

acycl

icgra

phs

Eu

GeÁ n

e(3

1)

(htt

p:/

/ww

w.i

nra

.fr/

bia

/T/E

uG

ene)

Ara

bid

opsi

sT

hre

e-per

iodic

IMM

for

exons,

one

IMM

for

intr

ons,

one

for

inte

rgen

icre

gio

ns,

one

for

UT

R.

Net

Gen

e2/

Spli

ceP

redic

tor

for

spli

cesi

tes

DP

ES

T/c

DN

A,

pro

tein

Gen

eId

3(8

9)

(htt

p:/

/ww

w1

.im

im.e

s/g

enei

d.h

tml)

Ver

tebra

tes,

pla

nts

Rule

-bas

edm

ethod;

WA

M,

dis

crim

inan

tan

alysi

s.D

PE

ST

GE

NE

FIN

DE

R(2

8):

FG

EN

E,

FE

X¼

(htt

p:/

/gen

om

ic.s

ang

er.a

c.u

k/g

f/g

f.h

tml;

htt

p:/

/ww

w.s

oft

ber

ry.c

om

/ber

ry.p

htm

l)H

um

an,

mouse

,D

roso

phil

a,

Caen

orh

abdit

isel

egans,

yea

st,

dic

ots

,m

onoco

ts,

Sch

izosa

ccharo

myc

espom

be,

Neu

rosp

ora

crass

a

Lin

ear

dis

crim

inan

tan

alysi

sD

PP

rote

in

GE

NE

FIN

DE

R(G

reen

)L

og

likel

ihood

rati

osc

ore

mat

rix

on

MM

DP

Gen

eGen

erat

or

(92

)M

aize

Logit

linea

rm

odel

sfo

rsp

lice

site

s,st

art;

3rd

to5th

ord

erM

Mfo

rex

ons

and

intr

ons

DP

Gen

eMar

k(2

9)

(htt

p:/

/op

al.b

iolo

gy

.gat

ech

.ed

u/G

eneM

ark

/gen

emar

k2

4.c

gi)

Pro

kar

yote

s,eu

kar

yote

s5th

ord

erM

M(h

om

ogen

eous

for

intr

ons,

thre

e-per

iodic

for

exons)

No

Gen

eMar

k.h

mm

(35

)(h

ttp

://o

pal

.bio

log

y.g

atec

h.e

du

/Gen

eMar

k/e

uk

hm

m.c

gi)

Hum

an,

mouse

,D

roso

phil

a,

Gall

us

gall

us,

Ara

bid

opsi

s,ri

ce,

mai

ze,

Chla

myd

om

onas

rein

hard

tii,

C.e

legans,

Hord

eum

vulg

are

,T

riti

cum

aes

tivu

m

5th

ord

erM

M(h

om

ogen

eous

for

intr

ons,

thre

e-per

iodic

for

exons)

GH

MM

,D

PU

nder

dev

elopm

ent

Gen

eMo

del

er(5

7)

(ftp

://f

tp.t

igr.

org

/pu

b/s

oft

war

e/g

m/)

Eukar

yote

sN

ucl

eoti

de

and

din

ucl

eoti

de

com

posi

tion,

conse

nsu

sfo

rsp

lice

site

sR

ule

-bas

edm

ethod

Gen

ePar

ser

(27

)(h

ttp

://b

eag

le.c

olo

rad

o.e

du/~

eesn

yd

er/G

eneP

arse

r.h

tml)

Ver

tebra

tes

NN

DP

ES

T

Gen

ie(4

4,9

6)

(htt

p:/

/ww

w.f

ruit

¯y.o

rg/s

eq_

too

ls/g

enie

.htm

l)D

roso

phil

a,

hum

an,

oth

erN

NG

HM

M,

DP

Pro

tein

Gen

Lan

g(1

54

)(h

ttp

://w

ww

.cb

il.u

pen

n.e

du/g

enla

ng/g

enla

ng

_h

om

e.h

tml)

Ver

tebra

tes,

Dro

sophil

a,

dic

ots

Gra

mm

arru

les,

WA

M,

hex

tuple

freq

uen

cies

¼C

har

tpar

sing,

DP

Gen

om

eSca

nV

erte

bra

tes

Gen

scan

met

hod,

BL

AS

TP

or

BL

AS

TX

GH

MM

,D

PP

rote

inG

EN

SC

AN

(30

)(h

ttp

://g

enes

.mit

.ed

u/G

EN

SC

AN

.htm

l)V

erte

bra

tes,

Ara

bid

opsi

s,m

aize

WA

Mfo

rac

cepto

r;M

DD

for

donor;

5th

ord

erM

M(h

om

ogen

eous

for

intr

ons,

thre

e-per

iodic

for

exons)

GH

MM

,D

PP

rote

in,

Gen

om

eSca

n(9

9)

GE

NV

IEW

2(2

5)

(htt

p:/

/l2

5.i

tba.

mi.

cnr.

it/~

web

gen

e/w

ww

gen

e.h

tml)

Hum

an,

mouse

,dip

tera

Lin

ear

com

bin

atio

n,

dic

odon

stat

isti

cD

P

Gli

mm

erM

(33

,34)

(sal

zber

g@

cs.j

hu

.ed

u)

Sm

all

eukar

yote

s,A

rabid

opsi

s,ri

ceT

hre

e-per

iodic

IMM

for

exons

(ord

er0±8),

IMM

for

intr

ons,

2nd

ord

erM

Mfo

rsp

lice

site

s

DP

GR

AIL

/GA

P3

(90

,155

)(h

ttp

://c

om

pb

io.o

rnl.

gov

/Gra

il-b

in/E

mp

tyG

rail

Form

)H

um

an,

mouse

,A

rabid

opsi

s,D

roso

phil

aN

ND

PE

ST

,cD

NA

GR

PL

(97

)H

um

an,

Dro

sophil

a,

Ara

bid

opsi

sR

efer

ence

poin

tlo

gis

tic

for

spli

cesi

tes,

5th

ord

erM

M(h

om

ogen

eous

for

intr

ons,

thre

e-per

iodic

for

exons)

GH

MM

,D

PP

rote

in

HM

Mgen

e(9

8)

(htt

p:/

/ww

w.c

bs.

dtu

.dk

/ser

vic

es/H

MM

gen

e/)

Ver

tebra

tes,

C.e

legans

Thre

e-per

iodic

4th

ord

erM

Mfo

rex

ons,

3rd

ord

erM

Mfo

rin

trons

CH

MM

MO

RG

AN

(48

)(h

ttp

://w

ww

.cs.

jhu

.ed

u/l

abs/

com

pb

io/m

org

an.h

tml)

Ver

tebra

tes

Dec

isio

ntr

eesy

stem

DP

MZ

EF

(26

)(h

ttp

://a

rgo

n.c

shl.

org

/gen

e®n

der

/)H

um

an,

mouse

,A

rabid

opsi

s,®

ssio

nyea

stQ

uad

rati

cdis

crim

inan

tan

alysi

sN

o

SO

RF

IND

(24

)M

atri

xm

ethod

for

star

tan

dsp

lice

site

s,hex

amer

usa

ge

(Fouri

erm

easu

re)

No

Tw

insc

an(1

00

)M

ouse

,hum

anG

ensc

anm

ethod;

5th

ord

erM

Mfo

rU

TR

and

inte

rgen

ic,

WA

Mfo

rac

cepto

rsi

tes

GH

MM

Gen

om

icse

quen

ceV

EIL

(47

)(h

ttp

://w

ww

.cs.

jhu

.ed

u/l

abs/

com

pb

io/v

eil.

htm

l)V

erte

bra

tes

HM

MD

PX

po

und

(15

6)

(htt

p:/

/bio

web

.pas

teu

r.fr

/seq

anal

/in

terf

aces

/xp

ou

nd

-sim

ple

.htm

l)H

um

anT

hre

e-per

iodic

1st

ord

erM

Mfo

rex

ons,

1st

ord

erM

Mfo

rin

tron

san

din

terg

enic

HM

M

CH

MM

,cl

ass

HM

M;

GH

MM

,g

ener

aliz

edH

MM

;IM

M,

inte

rpo

late

dM

M;

MM

,M

arkov

model

.


at UC

SF Library and C

enter for Know

ledge Managem

ent on Decem


Dow

nloaded from


decision tree system in MORGAN or discriminant analysis inFGENE and MZEF.

Integrated approaches

Aware of the added value provided by database similarities,authors are now combining both intrinsic and extrinsicapproaches in recent gene predictors and, in older software,updates are made to add information from homology. Apioneer in the area was the GSA program (Gene StructureAssembly), born from the fusion between AAT and Genscan,and whose results on the Burset and GuigoÂ dataset are betterthan those obtained with the two programs separately(X.Huang, personal communication). GenomeScan (99) isBurge's own extension of Genscan to incorporate similaritywith a protein retrieved by BLASTX or BLASTP. Genespredicted by GenomeScan have a maximum probabilityconditional on such similarity information. GenomeScan isthus able to accurately predict coding regions missed by bothGenscan and BLASTX used alone. Although no paper existsdescribing them, FGENESH+ and FGENESH_C are otherextensions of an existing algorithm, FGENESH, that usesimilarity to a protein or a cDNA sequence, respectively, toimprove gene prediction. In fact, most intrinsic predictionprograms now offer (or will soon offer) the possibility ofintegrating similarities with expressed sequences to con®rmtheir prediction (see Table 3). The integration of similaritybetween genomic sequences has so far been considered only inFGENES-H2 and Twinscan (100), which use the FGENESHand Genscan programs, respectively. Such integration repre-sents a very promising approach that will undoubtedly bemuch further developed in the future.

A highly integrative approach is used in the EuGeÁneprogram: it combines NetGene2 and SplicePredictor for splicesite prediction, NetStart (54) for translation initiation predic-tion, IMM-based content sensors and similarity informationfrom protein, EST and cDNA matches. All these sensors areweighted, with the weights being optimized in order tomaximize the number of successes (as is done in theGeneParser and Dagger programs). This approach producesa reliable software, as attested by the results obtained on theA.thaliana genes that were presented at the Georgia Tech Insilico Biology International Conference in November 1999. Itsgood performance probably relies to a large part on themaximum of success criterion and on the fact that theprogram is using a speci®c model for intergenic regions. It isalso worth mentioning GAZE (http://ismb01.cbs.dtu.dk/GeneFinding.html#A304), which should allow the integrationof arbitrary prediction information from multiple sourcessupplied by the user.

In the same order of ideas, it is possible to directly combinethe predictions of several programs in order to obtain a sort ofconsensus. Such an approach was tested by several authorsand involved different software (17,101,102). DIGIT (http://ismb01.cbs.dtu.dk/GeneFinding.html#A303) is such a soft-ware. It integrates FGENESH, Genscan and HMMgene. It isquite obvious that only keeping exons shared by two or morepredictions has the advantage of signi®cantly decreasing thenumber of over-predictions but may lead to a poor sensibilityand possible inconsistencies at the gene level. However,promising results were recently obtained with a new ¯exiblemethod for combining any number of gene prediction systems,

as a combination of experts, inside a Bayesian framework(103).

Finally, it must be mentioned that there exist annotationplatforms that allow the running of several selected geneprediction programs and/or which graphically display theirresults. Among the more widely used are Genotator (104),MagPie (105) and Ensembl (106). If such platforms are notprediction programs themselves, they gather evidence ob-tained from ab initio or homology-based prediction programsand thus are complementary and useful tools that facilitateeither human driven or automated annotations.

QUALITY OF THE PREDICTIONS AND PRINCIPALREMAINING PROBLEMS

Prediction accuracy

For numerical data on the quality of the predictions obtainedby most ab initio programs, the reader is invited to see Bursetand GuigoÂ (15), Rogic et al. (16) and Pavy et al. (17) or torefer to the homepage of each software, which sometimesprovides such data.

Although the statistical properties of coding regions allowfor a good discrimination between large coding and non-coding regions, the exact identi®cation of the limits of thecoding parts of the exons or of gene boundaries remainsdif®cult. In the ®rst case, predicted coding region limits areoften incorrect. In the second case, the predicted structurefrequently splits a single true gene into several or, alterna-tively, merges several genes into one. To address theseproblems, several teams are currently working on the terminalparts of genes. Already, a program especially dedicated topredicting 3¢-terminal exons (107) and another for identifying5¢-terminal exons (108) have been made available fromM. Zhang's laboratory. Such problems are, however, verycomplex, as intergenic and intronic sequences do not differmuch. Furthermore, speci®c signals in the 5¢- or 3¢-ends ofgenes (e.g. the TATA box and the polyadenylation signal),which could be expected to be useful for predicting geneboundaries, are often too variable. Once more, improvementscan be achieved by looking for a combination of severalsignals together with compositional biases (109). Sometimes,however, such signals are not even present (53,110).

It is clear, and has recently been underlined (111), that geneprediction accuracy drops signi®cantly with large DNAsequences. This is due to a decrease in gene density and tothe presence of larger introns.

No exhaustive evaluation of the homology-based predictionprograms has been published so far, but some generalcomments can be made that are based on both experienceand the results discussed in the papers describing eachalgorithm. As expected, the programs that use expressiondata, such as protein or mRNA sequences, have the advantageof generating fewer false predictions, i.e. they tend to havegood speci®city. The obvious counterpart is of course thatwhen no gene is predicted, this does not imply that no gene ispresent. Moreover, determination of the complete genestructure by such methods is also dif®cult or impossiblewhen either the target expressed sequence is partial or whenthe evolutionary distance between the compared sequences istoo great. However, even partial hits can give useful clues to


at UC

SF Library and C

enter for Know

ledge Managem

ent on Decem


Dow

nloaded from


the presence of a gene and, therefore, the homology-basedmethods are important in all annotation processes.

Whatever the approach, gene prediction depends, to a largeextent, on the current biological knowledge, especiallyknowledge at the molecular level of gene expression whichkeeps evolving and accumulating. Besides mainstream mech-anisms, some unexpected non-canonical cases are regularlyreported in the literature, which again increase the complexityof the problem.

Pitfalls and issues to be addressed

Several issues make the problem of eukaryotic gene ®ndingextremely dif®cult. (i) Very long genes: for example, thelargest human gene, the dystrophin gene, is composed of 79exons spanning nearly 2.3 Mb (112). (ii) Very long introns:again, in the human dystrophin gene, some introns are >100 kblong and >99% of the gene is composed of introns. (iii) Veryconserved introns or 3¢-UTRs (113), either between differentspecies or within gene families: this is particularly a problemwhen gene prediction is addressed through similarity searches.A comparative analysis of 77 orthologous mouse and humangene pairs concluded that 56% of 3¢-UTRs are covered byalignable blocks (114). (iv) Very short exons: some exons areonly 3 bp long in Arabidopsis genes and probably even 1 bpfor the coding part of exons at either end of the codingsequence, meaning that start or stop codons can be interruptedby an intron (S.Aubourg, K.Vandepoele, P.DeÂhais, C.MatheÂand P.RouzeÂ, personal communication). Such small exons areeasily missed by all content sensors, especially if bordered bylarge introns. The more dif®cult cases are those where thelength of a coding exon is a multiple of three (typically 3, 6 or9 bp long), because missing such exons will not cause aproblem in the exon assembly as they do not introduce anychange in the frame.

Other issues could be addressed by making the gene modelcurrently adopted in most gene ®nders more sophisticated.

(i) Overlapping genes: though very rare in eukaryoticgenomes, there are some documented cases in animals as wellas in plants (115). The number of such cases will probablyincrease as soon as we consider them as probable. Overlappinggenerally involves the 3¢-UTR part of genes, but it can alsohappen that there is a gene within an intron of another gene[the ®rst example was reported by Henikoff et al. (116)].

(ii) Polycistronic gene arrangement: albeit this situation wasinitially thought to occur only in prokaryotes, polycistronicgenes have been found in eukaryotes. Many such cases areknown for snoRNA genes in plants (117), as well as forprotein encoding genes in nematodes, and so far isolatedexamples have been observed in mammals (118).

(iii) Frameshifts: some sequences stored in databases maycontain errors (either sequencing errors or simply errors madewhen editing the sequence) resulting in the introduction ofarti®cial frameshifts (deletion or insertion of one base). Suchframeshifts greatly increase the dif®culty of the computationalgene ®nding problem by producing erroneous statistics andmasking true solutions. Some programs are said to be able todetect frameshifts, like GeneMark and AAT. Some homology-based programs, such as Utopia and Pro-Frame (119), canhandle such cases. Some programs speci®cally dedicated tothis problem exist, mainly for prokaryotic genomes, but alsofor coding eukaryotic sequences such as FSED (120). More

recent programs such as FrameD (121) and ESTScan (122) areable to deal with noisy sequences and are typically aimed atdetecting and correcting frameshifts in cDNA and morespeci®cally in EST sequences.

(iv) Introns in non-coding regions: there are genes for whichthe genomic region corresponding to the 5¢- and/or 3¢-UTR inthe mature mRNA is interrupted by one or more intron(s). Theextremity of the gene is then composed of non-coding exonsand intron(s). It was shown that such non-coding exons mightalso support alternative splicing (123). Some of the programsthat exploit similarities with ESTs or cDNAs, such asSplicePredictor, are actually able to predict such introns.Without such evidence, and since the base composition ofUTRs is more intron-like than coding exon-like (42), mostprograms simply ignore this problem.

(v) Non-canonical splice sites (reviewed in 124): they arenot really handled by any program yet. As long as examples ofnon-canonical introns are provided in the training set,programs such as GeneParser or NetPlantGene/NetGene2(and therefore EuGeÁne) should be able to identify some splicesites that do not use a GT-AG rule. The new FGENESHprogram (A.Salamov and V.Solovyev, unpublished) speci®c-ally allows for the existence of a GC donor site instead of aGT. Recently, mammalian annotated genes were controlledwith EST matching sequences, producing new sets of EST-supported non-canonical splice sites (125). It would beinteresting to do the same work on other genomes, as itprovides very useful data that could be integrated into geneprediction programs.

(vi) Cases of an alternative biological processing. (a)Alternative transcription start: e.g. three alternative promotersregulate the transcription of the 14 kb full-length dystrophinmRNAs and four `intragenic' promoters control that ofsmaller isoforms (112). Another problem related to promotersconcerns the fact that there is not one single class of corepromoter. Instead, many combinations of small elements arepossible (53). (b) Alternative splicing: EST-based methods arewidely used to identify more systematically such events and tounderstand their roles (126,127). A recent review on this topic(128) reports an estimated 35±59% of human genes showingevidence for at least one alternative splice site form. Somegene prediction programs try to handle this through theidenti®cation of sub-optimal exons (Genscan and MZEF) andsub-optimal gene structures [GeneParser, HMMgene,GeneGenerator and FGENES-M (V.Solovyev, unpublisheddata)]. Nevertheless, a more relevant approach would consistof improving the identi®cation of the intronic and/or exonicsignals that dictate the choice of alternative sites (129).(c) Alternative polyadenylation: here again, EST data are veryinformative and lead to an estimated 20% of human transcriptsshowing evidence of alternative polyadenylation (130).(d) Alternative initiation of translation: ®nding the rightAUG initiator is still a major concern for gene predictionmethods. The main reason is that experimental data on nativeproteins remain scarce. The biological process itself is farfrom simple and is therefore not yet fully elucidated [for arecent review on the initiation of translation, see Kozak (131)].Brie¯y, the rule stating that the ®rst AUG in the mRNA is theinitiator codon can be escaped through three mechanisms:context-dependent leaky scanning, re-initiation and directinternal initiation. Furthermore, it has been observed that a


at UC

SF Library and C

enter for Know

ledge Managem

ent on Decem


Dow

nloaded from


non-AUG triplet can sometimes act as the functional codon fortranslation initiation, as ACG in Arabidopsis (132) or CUG inhuman sequences (133). However, no current programconsiders such alternative forms.

Some of these cases remain rare enough so that taking theminto account in the gene prediction algorithms would lower theoverall accuracy of the programs while greatly increasing thelevel of dif®culty. Current programs thus aim at optimizingperformance on the majority of data. This is laudable, but onemust be aware that this is at the expense of obtaining a goodperformance on `outliers' of the population. This may be aproblem, particularly if one considers that such outlier casesappear marginal only because nobody (including `wet'biologists) is looking for them and that they may in fact bemore frequent than suspected. As mentioned, this may be thecase for overlapping genes.

Problems with databases and/or training sets

A major concern, already stressed by Claverie (11), is thatexisting sensors are very conservative since they nearlyalways rely on already known sequences, either in the form oftraining sets or of databases against which homology searchesare performed. Attempts to tackle this problem have beenproposed (134,135), which were dedicated to prokaryoticgenomes and were anyway not fully satisfying. However, itseems to be a natural and reasonable procedure to ®rst look forwhat is already known and then to try to extrapolate from ourcurrent knowledge. This approach should probably be con-sidered only as a ®rst step. A ®rst parsing of the data could bedone using one of these `conservative' methods and, either in asecond run or when unusual situations occur, the possibility ofnon-canonical cases should be examined, allowing a recon-sideration of the ®rst prediction proposed.

Another remark about training sets is that several datasetsmay be better than a single big one. Indeed, an important factconsidered by only a few programs is that, from a biologicalpoint of view, there is not a unique type of gene. Besides thefact that genes code for various types of proteins, they alsoexhibit differences in their level of expression, condition andcellular location of expression. There is therefore no reason toassume that only a unique gene model exists (136,137).Taking this into account has already proved to be relevant inthe case of E.coli (138). As a consequence, an interestingproblem would be to create such sub-data sets, with enoughdata to derive statistical models in a biologically relevant way.This problem was addressed in the case of prokaryoticgenomes by GeneMark-Genesis, which automatically clustersORFs on the basis of their codon usage and derives Markovmodels for each cluster obtained (139), and more recently byGeneMarkS (140). In eukaryotes, the problem is again morecomplex since, within a gene, some exons may better ®t onemodel and others another model. There are nevertheless someclues indicating that the same kind of procedure would berelevant (141). This remark also applies to signal sensors, e.g.for splice sites. Several consensus sequences could thus besearched for.

Last but not least, whatever method is adopted (intrinsic orextrinsic), there is a great need for `clean' sequence databases,i.e. for databases that are not redundant, contain reliableand relevant annotations and provide all necessary links tofurther data (142,143). Without such conditions, much false

information can be (and currently is) propagated by thepredictions made. If, in the case of extrinsic gene prediction,the usage of erroneous data has consequences only at the levelof the analyzed sequence itself, in the case of intrinsicprediction, using corrupted training sets can dramaticallyaffect the whole performance of the program. In bothsituations, it is therefore very important to ®lter the dataretrieved from the databanks in order to use only experimen-tally documented sequences as references and not data comingfrom predictions, such as those generated by automaticgenome annotations.

CONCLUSIONS

The prediction of protein encoding genes is obviously still inneed of improvement, as discussed in the previous section,especially for larger genomes. It has also to better tackle theproblem of alternative gene models and to take into accountnon-canonical, but nevertheless biologically signi®cant, cases.Since gene prediction leads to a structural annotation of thegenomes which is then used for experimentation, it would bewise to weight the predictions by giving a con®dence value foreach predicted gene, from high for a gene whose full structurehas been obtained in a non-ambiguous way using cognatecDNA data to low for a gene whose prediction totally dependson intrinsic approaches.

Moreover, algorithms and software descriptions providedby the authors are too often somewhat super®cial; inparticular, most papers do not give enough details on preciseparameter tuning, for instance. Until recently, there was alsono common vocabulary adopted among developers, with nocommon de®nitions used between different software for thephase or frames, etc. It is worth observing an interesting effortto design a general feature format (GFF) (http://www.sanger.ac.uk/Software/formats/GFF/). This was the result of anagreement between many developers. The format aims atstandardizing gene predictor outputs and vocabulary. Astandard output for all gene predictors allows for thedevelopment of common tools that can be used for down-stream analysis: evaluation, graphical representation andcombination of predictions. A few software are alreadyproducing GFF output, such as HMMgene, SGP-1 and theexon assembly program GenAmic, which is using GFF ®lesfor both input and output. At a larger scale, the Gene Ontologyconsortium (http://www.geneontology.org/) aims at providinga structured vocabulary that would allow the description ofgene products in any organism (144). Model organism groupshave started joining this consortium, largely contributing tothe uni®cation of biological information, whose importancewas recently emphasized (145).

With the huge amount of EST and cDNA sequences nowavailable, programs based on the existence of homology withsuch expressed sequences are playing an ever increasinglycrucial role in current genome annotations, at least for genesfor which expression can be shown. Even in the case of genesthat are scarcely or tissue speci®cally expressed, complemen-tary information can be provided by similarity betweengenomic sequences. Consequently, a great deal of effort isnow expended on trying to gather information from genomecomparisons. This is particularly true in the case of the humangenome annotation process, where the availability of other


at UC

SF Library and C

enter for Know

ledge Managem

ent on Decem


Dow

nloaded from


complete vertebrate genomes, such as those of mouse and ®sh,is a great advantage. With the many genome sequencingprojects currently under way, and although there are stillproblems to be solved (146), the comparative genomeapproach seems to be a very promising approach not only inthe ®eld of gene prediction but also for the identi®cation ofregulatory sequences and the deciphering of the so-called junkDNA. The latter has been largely ignored until now, yet muchmay be expected to be learned from its analysis (147,148).Even if in this review we have just discussed programs todetect protein coding genes, there is also an undetermined (butprobably high) number of genes producing functional non-coding RNAs (reviewed in 149), which may be identi®ed bygenomic comparison. More interest is now devoted to suchnon-coding RNAs (150,151), and this probably stands amongthe main future directions in computational approaches forgenome analysis.

Finally, we wish to again warn the users of gene predictionsoftware that the results produced should be taken withcaution: although such results are becoming increasingly morereliable, they do only remain predictions. These are veryuseful for speeding up gene discovery and knowledge miningthereof, but biological expertise remains necessary in order tocon®rm the existence of a virtual protein and to ®nd or proveits biological function and its condition of expression in theorganism.

ACKNOWLEDGEMENTS

We wish to thank the anonymous referees for their veryinteresting and constructive comments on the manuscript. Weare also very grateful to Jim Middleton and to Alain Vignal forrevising the English text.

REFERENCES

1. The International Human Genome Sequencing Consortium (2001)Initial sequencing and analysis of the human genome. Nature, 409,860±921.

2. The Arabidopsis Genome Initiative (2000) Analysis of the genomesequence of the ¯owering plant Arabidopsis thaliana. Nature, 408,796±815.

3. Goff,S.A., Ricke,D., Lan,T.H., Presting,G., Wang,R., Dunn,M.,Glazebrook,J., Sessions,A., Oeller,P., Varma,H. et al. (2002) A draftsequence of the rice genome (Oryza sativa L. ssp. japonica). Science,296, 92±100.

4. Myers,E., Sutton,G., Delcher,A., Dew,I., Fasulo,D., Flanigan,M.,Kravitz,S., Mobarry,C., Reinert,K., Remington,K. et al. (2000) Awhole-genome assembly of Drosophila. Science, 287, 2196±2204.

5. Claverie,J.M., Poirot,O. and Lopez,F. (1997) The dif®culty ofidentifying genes in anonymous vertebrate sequences. Comput. Chem.,21, 203±214.

6. Cho,Y. and Walbot,V. (2001) Computational methods for geneannotation: the Arabidopsis genome. Curr. Opin. Biotechnol., 12,126±130.

7. Borodovsky,M., Rudd,K.E. and Koonin,E.V. (1994) Intrinsic andextrinsic approaches for detecting genes in a bacterial genome. NucleicAcids Res., 22, 4756±4767.

8. Fickett,J.W. (1996) The gene identi®cation problem: an overview fordeveloper. Comput. Chem., 20, 103±118.

9. RouzeÂ,P., Pavy,N. and Rombauts,S. (1999) Genome annotation: whichtools do we have for it? Curr. Opin. Plant Biol., 2, 90±95.

10. Fickett,J.W. (1996) Finding genes by computer: the state of the art.Trends Genet., 12, 316±320.

11. Claverie,J.M. (1997) Computational methods for the identi®cation ofgenes in vertebrate genomic sequences. Hum. Mol. Genet., 6,1735±1744.

12. GuigoÂ,R. (1997) Computational gene identi®cation: an open problem.Comput. Chem., 21, 215±222.

13. Haussler,D. (1998) Computational gene®nding. Trends Biochem. Sci.,12±15.

14. Burge,C. and Karlin,S. (1998) Finding the genes in genomic DNA.Curr. Opin. Struct. Biol., 8, 346±354.

15. Burset,M. and GuigoÂ,R. (1996) Evaluation of gene structure predictionprograms. Genomics, 34, 353±367.

16. Rogic,S., Mackworth,A. and Ouellette,F. (2001) Evaluation of gene-®nding programs on mammalian sequences. Genome Res., 11, 817±832.

17. Pavy,N., Rombauts,S., DeÂhais,P., MatheÂ,C., Ramana,D.V.V., Leroy,P.and RouzeÂ,P. (1999) Evaluation of gene prediction software using agenomic data set: application to Arabidopsis thaliana sequences.Bioinformatics, 15, 887±899.

18. Mignone,F., Gissi,C., Liuni,S. and Pesole,G. (2002) Untranslatedregions of mRNAs. Genome Biol., 3, reviews0004.1-0004.10.

19. Pearson,W.R. and Lipman,D.J. (1988) Improved tools for biologicalsequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444±2448.

20. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990)Basic local alignment search tool. J. Mol. Biol., 215, 403±410.

21. Bailey,L.C., Searls,D.B. and Overton,G.C. (1998) Analysis of EST-driven gene annotation in human genomic sequence. Genome Res., 8,362±376.

22. Fickett,J.W. (1995) ORFs and genes: how strong a connection?J. Comput. Biol., 2, 117±123.

23. Fickett,J.W. and Tung,C.S. (1992) Assessment of protein codingmeasures. Nucleic Acids Res., 20, 6441±6450.

24. Hutchinson,G.B. and Hayden,M.R. (1992) The prediction of exonsthrough an analysis of spliceable open reading frames. Nucleic AcidsRes., 20, 3453±3462.

25. Milanesi,L., Kolchanov,N.A., Rogozin,I.B., Ischenko,I.V., Kel,A.E.,Orlov,Y.L., Ponomarenko,M.P. and Vezzoni,P. (1993) GenView: acomputing tool for protein-coding regions prediction in nucleotidesequences. In Lim,H.A., Fickett,J.W., Cantor,C.R. and Robbins,R.J.(eds), Proceedings of the Second International Conference onBioinformatics, Supercomputing and Complex Genome Analysis. WorldScienti®c Publishing, Singapore, pp. 573±588.

26. Zhang,M.Q. (1997) Identi®cation of protein coding regions in thehuman genome by quadratic discriminant analysis. Proc. Natl Acad.Sci. USA, 94, 565±568. [Erratum (1997) Proc. Natl Acad. Sci. USA, 94,5495]

27. Snyder,E.E. and Stormo,G.D. (1995) Identi®cation of protein codingregions in genomic DNA. J. Mol. Biol., 248, 1±18.

28. Solovyev,V. and Salamov,A. (1997) The Gene-Finder computer toolsfor analysis of human and model organisms genome sequences. InGaasterland,T., Karp,P., Karplus,K., Ouzounis,C., Sander,C. andValencia,A. (eds), The Fifth International Conference on IntelligentSystems for Molecular Biology. AAAI Press, Menlo Park, CA,pp. 294±302.

29. Borodovsky,M. and McIninch,J. (1993) GENMARK: parallel generecognition for both DNA strands. Comput. Chem., 17, 123±133.

30. Burge,C. and Karlin,S. (1997) Prediction of complete gene structures inhuman genomic DNA. J. Mol. Biol., 268, 78±94.

31. Schiex,T., Moisan,A. and RouzeÂ,P. (2001) EuGeÁne: an eukaryotic gene®nder that combines several sources of evidence. In Gascuel,O. andSagot,M.-F. (eds), Lecture Notes in Computer Science, Vol. 2006, FirstInternational Conference on Biology, Informatics, and Mathematics,JOBIM 2000. Springer-Verlag, Germany, pp. 111±125.

32. Salzberg,S., Delcher,A., Kasif,S. and White,O. (1998) Microbial geneidenti®cation using interpolated Markov models. Nucleic Acids Res., 26,544±548.

33. Salzberg,S.L., Pertea,M., Delcher,A.L., Gardner,M.J. and Tettelin,H.(1999) Interpolated Markov models for eukaryotic gene ®nding.Genomics, 59, 24±31.

34. Delcher,A.L., Harmon,D., Kasif,S., White,O. and Salzberg,S.L. (1999)Improved microbial gene identi®cation with GLIMMER. Nucleic AcidsRes., 27, 4636±4641.

35. Lukashin,A.V. and Borodovsky,M. (1998) GeneMark.hmm: newsolutions for gene ®nding. Nucleic Acids Res., 26, 1107±1115.

36. Bernardi,G. (1989) The isochore organization of the human genome.Annu. Rev. Genet., 23, 637±661.


at UC

SF Library and C

enter for Know

ledge Managem

ent on Decem


Dow

nloaded from


37. Montero,L.M., Salinas,J., Matassi,G. and Bernardi,G. (1990) Genedistribution and isochore organization in the nuclear genome of plants.Nucleic Acids Res., 18, 1859±1867.

38. Duret,L., Mouchiroud,D. and Gautier,C. (1995) Statistical analysis ofvertebrate sequences reveals that long genes are scarce in GC-richisochores. J. Mol. Evol., 40, 308±317.

39. Rogozin,I.B. and Milanesi,L. (1997) Analysis of donor splice signals indifferent organisms. J. Mol. Evol., 45, 50±59.

40. Kleffe,J., Hermann,K., Vahrson,W., Wittig,B. and Brendel,V. (1996)Logitlinear models for the prediction of splice sites in plant pre-mRNAsequences. Nucleic Acids Res., 24, 4709±4718.

41. Brunak,S., Engelbrecht,J. and Knudsen,S. (1991) Prediction of humanmRNA donor and acceptor sites from the DNA sequence. J. Mol. Biol.,220, 49±65.

42. Hebsgaard,S.M., Korning,P.G., Tolstrup,N., Engelbrecht,J., RouzeÂ,P.and Brunak,S. (1996) Splice site prediction in Arabidopsis thaliana premRNA by combining local and global sequence information. NucleicAcids Res., 24, 3439±3452.

43. Tolstrup,N., RouzeÂ,P. and Brunak,S. (1997) A branch point consensusfrom Arabidopsis found by non-circular analysis allows for betterprediction of acceptor sites. Nucleic Acids Res., 25, 3159±3163.

44. Reese,M.G., Eeckman,F.H., Kulp,D. and Haussler,D. (1997) Improvedsplice site detection in Genie. In Istrail,S., Pevzner,P. and Waterman,M.(eds), First Annual International Conference on ComputationalMolecular Biology (RECOMB). ACM Press, New York, NY,pp. 232±240.

45. Zhang,M.Q. and Marr,T.G. (1993) A weight array method for splicingsignal analysis. Comput. Appl. Biosci., 9, 499±509.

46. Salzberg,S.L. (1997) A method for identifying splice sites andtranslational start sites in eukaryotic mRNA. Comput. Appl. Biosci., 13,365±376.

47. Henderson,J., Salzberg,S. and Fasman,K. (1997) Finding genes inhuman DNA with a hidden Markov model. J. Comput. Biol., 4,127±141.

48. Salzberg,S., Delcher,A., Fasman,K. and Henderson,J. (1998) A decisiontree system for ®nding genes in DNA. J. Comput. Biol., 5, 667±680.

49. Rabiner,L.R. (1989) A tutorial on hidden Markov models and selectedapplications for speech recognition. Proc. IEEE, 77, 257±285.

50. Krogh,A. (1998) An introduction to hidden Markov models forbiological sequences. In Salzberg,S.L., Searls,D.B. and Kasif,S. (eds),Computational Methods in Molecular Biology. Elsevier, Amsterdam,The Netherlands, pp. 46±63.

51. Patterson,D.J., Yasuhara,K. and Ruzzo,W.L. (2002) Pre-mRNAsecondary structure prediction aids splice site prediction. InAltman,R.B., Dunker,A.K., Hunter,L., Lauderdale,K. and Klein,T.E.(eds), Paci®c Symposium on Biocomputing, Vol. 7, World Scienti®c,Singapore, pp. 223±234.

52. Ohler,U. and Niemann,H. (2001) Identi®cation and analysis ofeukaryotic promoters: recent computational approaches. Trends Genet.,17, 56±60.

53. Pedersen,A.G., Baldi,P., Chauvin,Y. and Brunak,S. (1999) The biologyof eukaryotic promoter predictionÐa review. Comput. Chem., 23,191±207.

54. Pedersen,A.G. and Nielsen,H. (1997) Neural network prediction oftranslation initiation sites in eukaryotes: perspectives for EST andgenome analysis. In Gaasterland,T., Karp,P., Karplus,K., Ouzounis,C.,Sander,C. and Valencia,A. (eds), The Fifth International Conference onIntelligent Systems for Molecular Biology. AAAI Press, Menlo Park,CA, pp. 226±233.

55. Zien,A., Ratsch,G., Mika,S., Scholkopf,B., Lengauer,T. and Muller,K.(2000) Engineering support vector machine kernels that recognizetranslation initiation sites. Bioinformatics, 16, 799±807.

56. Nishikawa,T., Ota,T. and Isogai,T. (2000) Prediction whether a humancDNA sequence contains initiation codon by combining statisticalinformation and similarity with protein sequences. Bioinformatics, 16,960±967.

57. Fields,C.A. and Soderlund,C.A. (1990) gm: a practical tool forautomating DNA sequence analysis. Comput. Appl. Biosci., 6, 263±270.

58. Gelfand,M.S. (1990) Computer prediction of the exon±intron structureof mammalian pre-mRNAs. Nucleic Acids Res., 18, 5865±5869.

59. Gelfand,M.S., Mironov,A.A. and Pevzner,P.A. (1996) Gene recognitionvia spliced sequence alignment. Proc. Natl Acad. Sci. USA, 93,9061±9066.

60. Birney,E. and Durbin,R. (1997) Dynamite: a ¯exible code generatinglanguage for dynamic programming methods used in sequencecomparison. Proc. Int. Conf. Intell. Syst. Mol. Biol., 5, 56±64.

61. Rogozin,I.B., Milanesi,L. and Kolchanov,N.A. (1996) Gene structureprediction using information on homologous protein sequence. Comput.Appl. Biosci., 12, 161±170.

62. Gotoh,O. (2000) Homology-based gene structure prediction: simpli®edmatching algorithm using a translated codon (tron) and improvedaccuracy by allowing for long gaps. Bioinformatics, 16, 190±202.

63. Laub,M.T. and Smith,D.W. (1998) Finding intron/exon splice junctionsusing INFO, INterruption Finder and Organizer. J. Comput. Biol., 5,307±321.

64. Pachter,L., Batzoglou,S., Spitkovsky,V.I., Banks,E., Lander,E.S.,Kleitman,D.J. and Berger,B. (1999) A dictionary-based approach forgene annotation. J. Comput. Biol., 6, 419±430.

65. Thayer,E., Bystroff,C. and Baker,D. (2000) Detection of protein codingsequences using a mixture model for local protein amino acid sequence.J. Comput. Biol., 7, 317±327.

66. Huang,X., Adams,M.D., Zhou,H. and Kerlavage,A.R. (1997) A tool foranalyzing and annotating genomic sequences. Genomics, 46, 37±45.

67. Usuka,J. and Brendel,V. (2000) Gene structure prediction by splicedalignment of genomic DNA with protein sequences: increased accuracyby differential splice site scoring. J. Mol. Biol., 297, 1075±1085.

68. Usuka,J., Zhu,W. and Brendel,V. (2000) Optimal spliced alignment ofhomologous cDNA to a genomic DNA template. Bioinformatics, 16,203±211.

69. Florea,L., Hartzell,G., Zhang,Z., Rubin,G.M. and Miller,W. (1998) Acomputer program for aligning a cDNA sequence with a genomic DNAsequence. Genome Res., 8, 967±974.

70. Wheelan,S.J., Church,D.M. and Ostell,J.M. (2001) Spidey: a tool formRNA-to-genomic alignments. Genome Res., 11, 1952±1957.

71. Fukunishi,Y., Suzuki,H., Yoshino,M., Konno,H. and Hayashizaki,Y.(1999) Prediction of human cDNA from its homologous mouse full-length cDNA and human shotgun database. FEBS Lett., 464, 129±132.

72. Rogozin,I.B., D'Angelo,D. and Milanesi,L. (1999) Protein-codingregions prediction combining similarity searches and conservativeevolutionary properties of protein-coding sequences. Gene, 226,129±137.

73. Jiang,J. and Jacob,H.J. (1998) EbEST: an automated tool usingexpressed sequence tags to delineate gene structure. Genome Res., 8,268±275.

74. Mott,R. (1997) EST_GENOME: a program to align spliced DNAsequences to unspliced genomic DNA. Comput. Appl. Biosci., 13,477±478.

75. Kan,Z., Rouchka,E.C., Gish,W.R. and States,D.J. (2001) Gene structureprediction and alternative splicing analysis using genomically alignedESTs. Genome Res., 11, 889±900.

76. Delcher,A.L., Kasif,S., Fleischmann,R.D., Peterson,J., White,O. andSalzberg,S.L. (1999) Alignment of whole genomes. Nucleic Acids Res.,27, 2369±2376.

77. Kent,W.J. and Zahler,A.M. (2000) Conservation, regulation, synteny,and introns in a large-scale C. briggsae±C. elegans genomic alignment.Genome Res., 10, 1115±1125.

78. Schwartz,S., Zhang,Z., Frazer,K., Smit,A., Riemer,C., Bouck,J.,Gibbs,R., Hardison,R. and Miller,W. (2000) PipMakerÐa web serverfor aligning two genomic DNA sequences. Genome Res., 10, 577±586.

79. Morgenstern,B. (2000) A space-ef®cient algorithm for aligning largegenomic sequences. Bioinformatics, 16, 948±949.

80. Batzoglou,S., Pachter,L., Mesirov,J., Berger,B. and Lander,E.S. (2000)Human and mouse gene structure: comparative analysis and applicationto exon prediction. Genome Res., 10, 950±958.

81. Bafna,V. and Huson,D. (2000) The conserved exon method for gene®nding. In Bourne,P., Gribskov,M., Altman,R., Jensen,N., Hope,D.,Lengauer,T., Mitchell,J., Scheeff,E., Smith,C., Strande,S. andWeissig,H. (eds), Eighth International Conference on IntelligentSystems for Molecular Biology. AAAI Press, Menlo Park, CA, pp. 3±12.

82. Wiehe,T., Gebauer-Jung,S., Mitchell-Olds,T. and Guigo,R. (2001)SGP-1: prediction and validation of homologous genes based onsequence alignments. Genome Res., 11, 1574±1583.

83. Novichkov,P.S., Gelfand,M.S. and Mironov,A.A. (2001) Generecognition in eukaryotic DNA by comparison of genomic sequences.Bioinformatics, 17, 1011±1018.

84. Blayo,P., RouzeÂ,P. and Sagot,M.-F. (2002) Orphan gene ®ndingÐanexon assembly approach. Theor. Comput. Sci., in press.


at UC

SF Library and C

enter for Know

ledge Managem

ent on Decem


Dow

nloaded from


85. Pachter,L., Alexandersson,M. and Cawley,S. (2002) Applications ofgeneralized pair hidden Markov models to alignment and gene ®ndingproblems. J. Comput. Biol., 9, 389±399.

86. Jurka,J., Klonowski,P., Dagman,V. and Pelton,P. (1996) CENSORÐaprogram for identi®cation and elimination of repetitive elements fromDNA sequences. Comput. Chem., 20, 119±112.

87. Roytberg,M.A., Astakhova,T.V. and Gelfand,M.S. (1997)Combinatorial approaches to gene recognition. Comput. Chem., 21,229±235.

88. GuigoÂ,R. (1998) Assembling genes from predicted exons in linear timewith dynamic programming. J. Comput. Biol., 5, 681±702.

89. GuigoÂ,R., Knudsen,S., Drake,N. and Smith,T. (1992) Prediction of genestructure. J. Mol. Biol., 226, 141±157.

90. Xu,Y., Mural,R.J. and Uberbaker,E.C. (1994) Constructing gene modelsfrom accurately predicted exons: an application of dynamicprogramming. Comput. Appl. Biosci., 10, 613±623.

91. Chuang,J.S. and Roth,D. (2001) Gene recognition based on DAGshortest paths. Bioinformatics, 1, 1±9.

92. Kleffe,J., Hermann,K., Vahrson,W., Wittig,B. and Brendel,V. (1998)GeneGenerator±a ¯exible algorithm for gene prediction and itsapplication to maize sequences. Bioinformatics, 14, 232±243.

93. Viterbi,A. (1967) Error bounds for convolutional codes and anasymptotically optimal decoding algorithm. IEEE Trans. Informat.Theor., IT-13, 260±269.

94. Bellman,R.E. (1957) Dynamic Programming. Princeton UniversityPress, Princeton, NJ.

95. Krogh,A., Mian,I.S. and Haussler,D. (1994) A hidden Markov modelthat ®nds genes in E. coli DNA. Nucleic Acids Res., 22, 4768±4778.

96. Kulp,D., Haussler,D., Reese,M.G. and Eeckman,F.H. (1996) Ageneralized Hidden Markov Model for the recognition of human genesin DNA. In States,D.J., Agarwal,P., Gaasterland,T., Hunter,L. andSmith,R.F. (eds), Proceedings of the Fourth International Conferenceon Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park,CA, pp. 134±142.

97. Hooper,P., Zhang,H. and Wishart,D. (2000) Prediction of geneticstructure in eukaryotic DNA using reference point logistic regressionand sequence alignment. Bioinformatics, 16, 425±438.

98. Krogh,A. (1997) Two methods for improving performace of a HMMand their application for gene ®nding. In Gaasterland,T., Karp,P.,Karplus,K., Ouzounis,C., Sander,C. and Valencia,A. (eds), The FifthInternational Conference on Intelligent Systems for Molecular Biology.AAAI Press, Menlo Park, CA, pp. 179±186.

99. Yeh,R.-F., Lim,L.P. and Burge,C.B. (2001) Computational inference ofhomologous gene structures in the human genome. Genome Res., 11,803±816.

100. Korf,I., Flicek,P., Duan,D. and Brent,M.R. (2001) Integrating genomichomology into gene structure prediction. Bioinformatics, 17,S140±S148.

101. Murakami,K. and Tagaki,T. (1998) Gene recognition by combination ofseveral gene-®nding programs. Bioinformatics, 14, 665±675.

102. Solovyev,V.V. and Salamov,A.A. (1999) INFOGENE: a database ofknown gene structures and predicted genes and proteins in sequences ofgenome sequencing projects. Nucleic Acids Res., 27, 248±250.

103. Pavlovic,V., Garg,A. and Kasif,S. (2002) A Bayesian framework forcombining gene predictions. Bioinformatics, 18, 19±27.

104. Harris,N.L. (1997) Genotator: a workbench for sequence annotation.Genome Res., 7, 754±762.

105. Gaasterland,T. and Sensen,C.W. (1996) Fully automated genomeanalysis that re¯ects user needs and preferences. A detailed introductionto the MAGPIE system architecture. Biochimie, 78, 302±310.

106. Hubbard,T., Barker,D., Birney,E., Cameron,G., Chen,Y., Clark,L.,Cox,T., Cuff,J., Curwen,V., Down,T. et al. (2002) The Ensemblgenome database project. Nucleic Acids Res., 30, 38±41.

107. Tabaska,J., Davuluri,R. and Zhang,M. (2001) Identifying the3¢-terminal exon in human DNA. Bioinformatics, 17, 602±607.

108. Davuluri,R.V., Grosse,I. and Zhang,M.Q. (2001) Computationalidenti®cation of promoters and ®rst exons in the human genome. NatureGenet., 29, 412±417.

109. Down,T.A. and Hubbard,T.J. (2002) Computational detection andlocation of transcription start sites in mammalian genomic DNA.Genome Res., 12, 458±461.

110. Graber,J.H., Cantor,C.R., Mohr,S.C. and Smith,T.F. (1999) In silicodetection of control signals: mRNA 3¢-end-processing sequences indiverse species. Proc. Natl Acad. Sci. USA, 96, 14055±14060.

111. GuigoÂ,R., Agarwal,P., Abril,J., Burset,M. and Fickett,J. (2000) Anassessment of gene prediction accuracy in large DNA sequences.Genome Res., 10, 1631±1642.

112. Nobile,C., Marchi,J., Nigro,V., Roberts,R.G. and Danieli,G.A. (1997)Exon-intron organization of the human dystrophin gene. Genomics, 45,421±424.

113. Duret,L., Dorkeld,F. and Gautier,C. (1993) Strong conservation ofvertebrate non-coding sequences during vertebrate evolution: potentialinvolvement in post-transcriptional regulation of gene expression.Nucleic Acids Res., 21, 2315±2322.

114. Jareborg,N., Birney,E. and Durbin,R. (1999) Comparative analysis ofnoncoding regions of 77 orthologous mouse and human gene pairs.Genome Res., 9, 815±824.

115. Quesada,V., Ponce,M.R. and Micol,J.L. (1999) OTC and AUL1, twoconvergent and overlapping genes in the nuclear genome of Arabidopsisthaliana. FEBS Lett., 461, 101±106.

116. Henikoff,S., Keene,M.A., Fechtel,K. and Fristrom,J.W. (1986) Genewithin a gene: nested Drosophila genes encode unrelated proteins onopposite DNA strands. Cell, 44, 33±42.

117. Leader,D.J., Clark,G.P., Watters,J., Beven,A.F., Shaw,P.J. andBrown,J.W. (1997) Clusters of multiple different small nucleolar RNAgenes in plants are expressed as and processed from polycistronic pre-snoRNAs. EMBO J., 16, 5742±5751.

118. Blumenthal,T. (1998) Gene clusters and polycistronic transcription ineukaryotes. Bioessays, 20, 480±487.

119. Mironov,A.A., Novichkov,P.S. and Gelfand,M.S. (2001) Pro-Frame:similarity-based gene recognition in eukaryotic DNA sequences witherrors. Bioinformatics, 17, 13±15.

120. Fichant,G.A. and Quentin,Y. (1995) A frameshift error detectionalgorithm for DNA sequencing projects. Nucleic Acids Res., 23,2900±2908.

121. Salanoubat,M., Genin,S., Artiguenave,F., Gouzy,J., Mangenot,S.,Arlat,M., Billault,A., Brottier,P., Camus,J.C., Cattolico,L. et al. (2002)Genome sequence of the plant pathogen Ralstonia solanacearum.Nature, 415, 497±502.

122. Iseli,C., Jongeneel,C.V. and Bucher,P. (1999) ESTScan: a program fordetecting, evaluating, and reconstructing potential coding regions inEST sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol., 138±148.

123. Klein,M., Pieri,I., Uhlmann,F., P®zenmaier,K. and Eisel,U. (1998)Cloning and characterization of promoter and 5¢-UTR of the NMDAreceptor subunit epsilon 2: evidence for alternative splicing of 5¢-non-coding exon. Gene, 208, 259±269.

124. Sharp,P.A. and Burge,C.B. (1997) Classi®cation of introns: U2-type orU12-type. Cell, 91, 875±879.

125. Burset,M., Seledtsov,I. and Solovyev,V. (2000) Analysis of canonicaland non-canonical splice sites in mammalian genomes. Nucleic AcidsRes., 28, 4364±4375.

126. Croft,L., Schandorff,S., Clark,F., Burrage,K., Arctander,P. andMattick,J. (2000) ISIS, the intron information system, reveals the highfrequency of alternative splicing in the human genome. Nature Genet.,24, 340±341.

127. Modrek,B., Resch,A., Grasso,C. and Lee,C. (2001) Genome-widedetection of alternative splicing in expressed sequences of human genes.Nucleic Acids Res., 29, 2850±2859.

128. Modrek,B. and Lee,C. (2002) A genomic view of alternative splicing.Nature Genet., 30, 13±19.

129. Hastings,M.L. and Krainer,A.R. (2001) Pre-mRNA splicing in the newmillennium. Curr. Opin. Cell Biol., 13, 302±309.

130. Gautheret,D., Poirot,O., Lopez,F., Audic,S. and Claverie Jean,M.(1998) Alternate polyadenylation in human mRNAs: a large-scaleanalysis by EST clustering. Genome Res., 8, 524±530.

131. Kozak,M. (1999) Initiation of translation in prokaryotes and eukaryotes.Gene, 234, 187±208.

132. Riechmann,J.L., Ito,T. and Meyerowitz,E. (1999) Non-AUG initiationof AGAMOUS mRNA translation in Arabidpsis thaliana. Mol. Cell.Biol., 19, 8505±8512.

133. Vagner,S., Touriol,C., Galy,B., Audigier,S., Gensac,M., Amalric,F.,Bayard,F., Prats,H. and Prats,A. (1996) Translation of CUG- but notAUG-initiated forms of human ®broblast growth factor 2 is activated intransformed and stressed cells. J. Cell Biol., 135, 1391±1402.

134. Audic,S. and Claverie,J.-M. (1998) Self-identi®cation of protein-codingregions in microbial genomes. Proc. Natl Acad. Sci. USA, 95,10026±10031.


at UC

SF Library and C

enter for Know

ledge Managem

ent on Decem


Dow

nloaded from


135. Besemer,J. and Borodovsky,M. (1999) Heuristic approach to derivingmodels for gene ®nding. Nucleic Acids Res., 27, 3911±3920.

136. MeÂdigue,C., Rouxel,T., Vigier,P., HeÂnaut,A. and Danchin,A. (1991)Evidence for horizontal gene transfer in Escherichia coli speciation.J. Mol. Biol., 222, 851±856.

137. MatheÂ,C., Peresetsky,A., DeÂhais,P., Van Montagu,M. and RouzeÂ,P.(1999) Classi®cation of Arabidopsis thaliana gene sequences:clustering of coding sequences into two groups according to codonusage improves gene prediction. J. Mol. Biol., 285, 1977±1991.

138. Borodovsky,M., McIninch,J.D., Koonin,E.V., Rudd,K.E., MeÂdigue,C.and Danchin,A. (1995) Detection of new genes in a bacterial genomeusing Markov models for three gene classes. Nucleic Acids Res., 23,3554±3562.

139. Hayes,W.S. and Borodovsky,M. (1998) How to interpret an anonymousbacterial genome: machine learning approach to gene identi®cation.Genome Res., 8, 1154±1171.

140. Besemer,J., Lomsadze,A. and Borodovsky,M. (2001) GeneMarkS: aself-training method for prediction of gene starts in microbial genomes.Implications for ®nding sequence motifs in regulatory regions. NucleicAcids Res., 29, 2607±2618.

141. MatheÂ,C., DeÂhais,P., Pavy,N., Rombauts,S., Van Montagu,M. andRouzeÂ,P. (2000) Gene prediction and gene classes in Arabidopsisthaliana. J. Biotechnol., 78, 293±299.

142. Pennisi,E. (1999) Keeping genome databases clean and up to date.Science, 286, 447±450.

143. Smith,T.F. (1998) Functional genomics±bioinformatics is ready for thechallenge. Trends Genet., 14, 291±293.

144. The Gene Ontology Consortium (2001) Creating the gene ontologyresource: design and implementation. Genome Res., 11, 1425±1433.

145. Brazma,A. (2001) On the importance of standardisation in life sciences.Bioinformatics, 17, 113±114.

146. Miller,W. (2001) Comparison of genomic DNA sequences: solved andunsolved problems. Bioinformatics, 17, 391±397.

147. Makalowski,W. (2000) Genomic scrap yard: how genomes utilize allthat junk. Gene, 259, 61±67.

148. Bergman,C. and Kreitman,M. (2001) Analysis of conserved noncodingDNA in Drosophila reveals similar constraints in intergenic and intronicsequences. Genome Res., 11, 1335±1345.

149. Eddy,S.R. (1999) Noncoding RNA genes. Curr. Opin. Genet. Dev., 9,695±699.

150. Erdmann,V., Szymanski,M., Hochberg,A., Groot,N. andBarciszewski,J. (2000) Non-coding, mRNA-like RNAs database Y2K.Nucleic Acids Res., 28, 197±200.

151. Rivas,E. and Eddy,S. (2000) Secondary structure alone is generally notstatistically signi®cant for the detection of noncoding RNAs.Bioinformatics, 16, 583±605.

152. Pertea,M., Lin,X. and Salzberg,S. (2001) GeneSplicer: a newcomputational method for splice site prediction. Nucleic Acids Res., 29,1185±1190.

153. Brendel,V., Kleffe,J., Carle Urioste,J.C. and Walbot,V. (1998)Prediction of splice sites in plant pre-mRNA from sequence properties.J. Mol. Biol., 276, 85±104.

154. Dong,S. and Searls,D.B. (1994) Gene structure prediction by linguisticmethods. Genomics, 23, 540±551.

155. Xu,Y.X. and Uberbacher,E.C. (1997) Automated gene identi®cation inlarge-scale genomic sequences. J. Comput. Biol., 4, 325±338.

156. Thomas,A. and Skolnick,M.H. (1994) A probabilistic model fordetecting coding regions in DNA sequences. IMA J. Math. Appl. Med.Biol., 11, 149±160.


at UC

SF Library and C

enter for Know

ledge Managem

ent on Decem


Dow

nloaded from


current methods of gene prediction, their strengths and weaknesses

Documents