review open access origin and evolution of spliceosomal ......keywords: intron sliding, intron gain,...

28
REVIEW Open Access Origin and evolution of spliceosomal introns Igor B Rogozin 1 , Liran Carmel 2 , Miklos Csuros 3 and Eugene V Koonin 1* Abstract Evolution of exon-intron structure of eukaryotic genes has been a matter of long-standing, intensive debate. The introns-early concept, later rebranded introns firstheld that protein-coding genes were interrupted by numerous introns even at the earliest stages of life's evolution and that introns played a major role in the origin of proteins by facilitating recombination of sequences coding for small protein/peptide modules. The introns-late concept held that introns emerged only in eukaryotes and new introns have been accumulating continuously throughout eukaryotic evolution. Analysis of orthologous genes from completely sequenced eukaryotic genomes revealed numerous shared intron positions in orthologous genes from animals and plants and even between animals, plants and protists, suggesting that many ancestral introns have persisted since the last eukaryotic common ancestor (LECA). Reconstructions of intron gain and loss using the growing collection of genomes of diverse eukaryotes and increasingly advanced probabilistic models convincingly show that the LECA and the ancestors of each eukaryotic supergroup had intron-rich genes, with intron densities comparable to those in the most intron-rich modern genomes such as those of vertebrates. The subsequent evolution in most lineages of eukaryotes involved primarily loss of introns, with only a few episodes of substantial intron gain that might have accompanied major evolutionary innovations such as the origin of metazoa. The original invasion of self-splicing Group II introns, presumably originating from the mitochondrial endosymbiont, into the genome of the emerging eukaryote might have been a key factor of eukaryogenesis that in particular triggered the origin of endomembranes and the nucleus. Conversely, splicing errors gave rise to alternative splicing, a major contribution to the biological complexity of multicellular eukaryotes. There is no indication that any prokaryote has ever possessed a spliceosome or introns in protein-coding genes, other than relatively rare mobile self-splicing introns. Thus, the introns-first scenario is not supported by any evidence but exon-intron structure of protein-coding genes appears to have evolved concomitantly with the eukaryotic cell, and introns were a major factor of evolution throughout the history of eukaryotes. This article was reviewed by I. King Jordan, Manuel Irimia (nominated by Anthony Poole), Tobias Mourier (nominated by Anthony Poole), and Fyodor Kondrashov. For the complete reports, see the ReviewersReports section. Keywords: Intron sliding, Intron gain, Intron loss, Spliceosome, Splicing signals, Evolution of exon/intron structure, Alternative splicing, Phylogenetic trees, Mobile domains, Eukaryotic ancestor Genes in pieces: exon-intron structure of eukaryotic genes and the two spliceosomes In a memorable phrase of Walter Gilbert, eukaryotes pos- sess genes in piecesin which protein-coding sequences are interrupted by non-coding sequences denoted introns [1]. The introns are excised at the donor and acceptor splice sites such that the flanking coding regions, exons, are spliced by an extremely complex ribonucleoprotein molecular machine, the spliceosome [2,3]. Multiple introns interrupt the coding sequences in the great major- ity of genes in animals and plants, whereas intron dens- ities in fungi and unicellular eukaryotes are highly variable: many of the unicellular forms contain only a few introns in the entire genome whereas in others the intron density approaches that in animals and plants [4-6]. Re- markably, however, there is no sequenced genome of a full-fledged eukaryote without introns at all; only one intronless genome of a highly degraded remnant of a eukaryotic organism, a nucleomorph that has also lost the genes for the spliceosome subunits, has been reported [7]. * Correspondence: [email protected] 1 National Center for Biotechnology Information NLM/NIH, 8600 Rockville Pike, Bldg. 38A, Bethesda, MD 20894, USA Full list of author information is available at the end of the article © 2012 Rogozin et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Rogozin et al. Biology Direct 2012, 7:11 http://www.biology-direct.com/content/7/1/11

Upload: others

Post on 20-Feb-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

  • Rogozin et al. Biology Direct 2012, 7:11http://www.biology-direct.com/content/7/1/11

    REVIEW Open Access

    Origin and evolution of spliceosomal intronsIgor B Rogozin1, Liran Carmel2, Miklos Csuros3 and Eugene V Koonin1*

    Abstract

    Evolution of exon-intron structure of eukaryotic genes has been a matter of long-standing, intensive debate. Theintrons-early concept, later rebranded ‘introns first’ held that protein-coding genes were interrupted by numerousintrons even at the earliest stages of life's evolution and that introns played a major role in the origin of proteins byfacilitating recombination of sequences coding for small protein/peptide modules. The introns-late concept heldthat introns emerged only in eukaryotes and new introns have been accumulating continuously throughouteukaryotic evolution. Analysis of orthologous genes from completely sequenced eukaryotic genomes revealednumerous shared intron positions in orthologous genes from animals and plants and even between animals, plantsand protists, suggesting that many ancestral introns have persisted since the last eukaryotic common ancestor(LECA). Reconstructions of intron gain and loss using the growing collection of genomes of diverse eukaryotes andincreasingly advanced probabilistic models convincingly show that the LECA and the ancestors of each eukaryoticsupergroup had intron-rich genes, with intron densities comparable to those in the most intron-rich moderngenomes such as those of vertebrates. The subsequent evolution in most lineages of eukaryotes involved primarilyloss of introns, with only a few episodes of substantial intron gain that might have accompanied majorevolutionary innovations such as the origin of metazoa. The original invasion of self-splicing Group II introns,presumably originating from the mitochondrial endosymbiont, into the genome of the emerging eukaryote mighthave been a key factor of eukaryogenesis that in particular triggered the origin of endomembranes and thenucleus. Conversely, splicing errors gave rise to alternative splicing, a major contribution to the biologicalcomplexity of multicellular eukaryotes. There is no indication that any prokaryote has ever possessed a spliceosomeor introns in protein-coding genes, other than relatively rare mobile self-splicing introns. Thus, the introns-firstscenario is not supported by any evidence but exon-intron structure of protein-coding genes appears to haveevolved concomitantly with the eukaryotic cell, and introns were a major factor of evolution throughout the historyof eukaryotes. This article was reviewed by I. King Jordan, Manuel Irimia (nominated by Anthony Poole), TobiasMourier (nominated by Anthony Poole), and Fyodor Kondrashov. For the complete reports, see the Reviewers’Reports section.

    Keywords: Intron sliding, Intron gain, Intron loss, Spliceosome, Splicing signals, Evolution of exon/intron structure,Alternative splicing, Phylogenetic trees, Mobile domains, Eukaryotic ancestor

    Genes in pieces: exon-intron structure ofeukaryotic genes and the two spliceosomesIn a memorable phrase of Walter Gilbert, eukaryotes pos-sess “genes in pieces” in which protein-coding sequencesare interrupted by non-coding sequences denoted introns[1]. The introns are excised at the donor and acceptorsplice sites such that the flanking coding regions, exons,are spliced by an extremely complex ribonucleoprotein

    * Correspondence: [email protected] Center for Biotechnology Information NLM/NIH, 8600 RockvillePike, Bldg. 38A, Bethesda, MD 20894, USAFull list of author information is available at the end of the article

    © 2012 Rogozin et al.; licensee BioMed CentraCommons Attribution License (http://creativecreproduction in any medium, provided the or

    molecular machine, the spliceosome [2,3]. Multipleintrons interrupt the coding sequences in the great major-ity of genes in animals and plants, whereas intron dens-ities in fungi and unicellular eukaryotes are highlyvariable: many of the unicellular forms contain only a fewintrons in the entire genome whereas in others the introndensity approaches that in animals and plants [4-6]. Re-markably, however, there is no sequenced genome of afull-fledged eukaryote without introns at all; only oneintronless genome of a highly degraded remnant of aeukaryotic organism, a nucleomorph that has also lost thegenes for the spliceosome subunits, has been reported [7].

    l Ltd. This is an Open Access article distributed under the terms of the Creativeommons.org/licenses/by/2.0), which permits unrestricted use, distribution, andiginal work is properly cited.

    mailto:[email protected]://creativecommons.org/licenses/by/2.0

  • Rogozin et al. Biology Direct 2012, 7:11 Page 2 of 28http://www.biology-direct.com/content/7/1/11

    The ubiquity of introns in eukaryotes is complementedby the conservation of the spliceosome. The spliceosomeconsists of five snRNPs (small nuclear ribonucleoproteinparticles), together with numerous less stably associatedproteins; the core of the spliceosome is conserved in allwell-characterized eukaryotes [2,3,8]. The spliceosomeinteracts with specific sites in the intron and the flankingexons to ensure accurate and efficient splicing. Thenucleotides at the intron termini and the adjacent nucleo-tides in the exons are involved in these interactions andcomprise the splicing signals. The (A/C)AG|GU(A/G)AGU sequence (the splice site is shown by the verticalstreak and the first two nucleotides of the intron areunderlined) at the donor splice signal is complementary tothe 5’ end of the U1 snRNA, and this interaction appearsto be the major requirement for splicing [9-11]. The (C,U)AG|G sequence (the last two nucleotides of the intron areunderlined) preceded by a polypyrimidine tract is typicalof the acceptor splice signal (Figure 1) and is recognizedby the U5 snRNA [12,13]. A short branch point signal islocated in the intron sequence upstream of the acceptorsplice signals and contains the reactive adenosine that isinvolved in the formation of the lariat-like structure in thesplicing intermediate [12,13]. The functionally important(A/C)AG||G exon sequences flanking introns have beendubbed protosplice sites with the implication that newintrons insert into sites of this structure [14,15]. Somelineage-specific deviations from the canonical variants ofsplice signals are known to exist. For example, someunicellular eukaryotes lack recognizable polyT tracts be-tween the branch point signal and the 3’ splice signal[16,17]. Some extremely intron-poor species such as yeastpossess an unusual, strictly constrained donor splice signal|GTA(T,A,C)G(T,A,C) with a substantial excess of T atposition +4 [16-18].The vast majority of spliceosomal introns contain |GT

    at the donor splice site and AG| at the acceptor splicesite. However, a distinct class of rare introns has beenrecognized on the basis of their unusual terminal dinu-cleotides: these introns contain |AT at the donor splicesite and AC| at the acceptor splice site [20,21]. A closerexamination of the sequences of these atypical introns

    Figure 1 Consensus motifs for donor and acceptor splicing signals. Thbias based on information content). The data is from [19].

    revealed several properties that distinguish them fromthe majority of the introns including conservation of un-usual signals at the donor splice signal (|ATATCCTT)and immediately upstream of the acceptor splice signal(TCCTTAAC 10-15 bases from the splice junction)[20,21]. Introns of this class are excised by a distinct, so-called minor or U12 spliceosome, which contains severalspecific, low-abundance snRNPs. It has been subse-quently shown that some |GT-AG| introns are alsoremoved by the U12 spliceosome [22]. The U12 intronsand the associated minor spliceosome are not universallyconserved, like the major U2 spliceosome, but are alsowidespread in eukaryotes, being represented in verte-brates, insects, plants, and some protists [23-26].Phylogenomic reconstructions for the small RNA and

    protein subunits of the U2 and U12 spliceosomes suggestthat both spliceosomes were already present in the lastcommon ancestor of the extant eukaryotes (LECA, LastEukaryotic Common Ancestor) as a result of ancient du-plication of the genes for the respective components [24].Taking into account a potentially important role of U12introns in regulation of gene expression [27-29], it mightbe tempting to speculate that the ancestral introns were ofthe U12 type (for example, see discussion by the reviewer#3 below) but have been subsequently converted to U2introns. However, comparison of protosplice sites (exonicsequences surrounding introns) of ancient U2 and U12introns in human and Arabidopsis revealed close similar-ity of ancestral introns to U2 but not to U12. Thus, theprimordial spliceosomal introns were most likely of theU2-type [30].The two principal mechanisms of splicing signal recog-

    nition are known as exon definition and intron definition[31-34]. Evidence of these two mechanisms has comefrom analyses of interactions between pre-mRNAs andvarious splicing factors [32,33,35]. The exon definitionmechanism involves SR proteins binding to exonic spli-cing enhancers (ESE) and recruiting U1 to the down-stream donor splicing signal and the splicing factor U2AFto the upstream acceptor splicing signal. The U2AF factorthen recruits U2 to the branch site. Therefore, when theSR proteins bind the ESEs, they promote formation of a

    e Y axis indicates the strength of splicing signals (base composition

  • Rogozin et al. Biology Direct 2012, 7:11 Page 3 of 28http://www.biology-direct.com/content/7/1/11

    “cross-exon” recognition complex by placing the basalsplicing machinery at the splice sites flanking the sameexon. The intron definition mechanism requires bindingof U1 to the upstream donor splice site and binding ofU2AF/U2 to the downstream acceptor splice signal andbranch site, respectively, of the same intron. Therefore, in-tron definition selects pairs of splice sites located on bothends of the same intron, and SR proteins can also mediatethis process [32,36]. The efficiency of splicing under theexon definition depends on the length of exons but is notaffected by the length of introns; conversely, under the in-tron definition, the efficiency of splicing depends on thelength of introns, but not that of exons [31-35,37].

    Introns-early, introns-late, introns-first: thecompeting scenarios of intron origin andevolutionEvolution of exon-intron structure of eukaryotic genesand evolutionary properties of introns had been longconsidered in the context of the “introns-early” vs.“introns-late” debate [38-42]. The original, “strong”introns-early hypothesis held that eukaryotic genesinherited (nearly) all introns from prokaryotic ancestorsand that the differences in gene structure among hom-ologous eukaryotic genes were due mostly to differentialintron loss [39]. Under this scenario, the extant prokar-yotes have lost all the primordial introns and the spliceo-some in the process of ‘genome streamlining’. The lateradaptations of the introns-early hypothesis assumed anintermediate position by allowing emergence of somenew introns, in addition to the ancient ones [40]. Theintrons-late concept countered that introns were aeukaryotic novelty and new introns have been emergingcontinuously throughout eukaryotic evolution; in thisscenario, bacteria and archaea never possessed intron orthe spliceosome [41,43,44]. These hypotheses have beenlater merged into a synthetic concept that can be

    21

    100 bp

    1 kbp int

    ron

    leng

    th

    FungiHolozoaAmoebozoaGreen plants & algaeNaegleriaEmilianaAlveolates & heterokonts

    Ehux

    Ptet

    Bbov

    Tgon

    ScerAaeg S

    Ngru

    Figure 2 Intron density and intron length in 100 eukaryotes. The data

    denoted ‘many introns early in eukaryotic evolution’[45,46] and that we discuss in greater detail below. Inaddition, there has been an attempt to revitalize theintrons early idea in the ‘introns first’ scenario accordingto which exons of protein-coding genes emerged fromthe primordial introns, i.e. non-coding regions that arepresumed to have been interspersed between functionalRNA sequences in the genes that existed in the RNAworld and antedated proteins [47,48].

    Intron density, size and distribution in protein-coding genes across the eukaryote domainGenes of eukaryotes from different groups dramaticallydiffer in intron density and size distribution, from only afew introns in the entire genome (that is, near zerodensity per gene or per kilobase) in many unicellularorganisms to approximately 6 introns per kilobase (kb)of coding sequence in mammals (Figure 2). With respectto intron content, eukaryotic genomes are often crudelyclassified into intron-poor ones (most unicellular forms)and intron-rich ones including animals, plants, somefungi, and a few unicellular organisms such as Chlamy-domonas or some diatoms (Figure 2) [42,49-52]. Al-though this division is appealing in its simplicity andmay be convenient for the purpose of various compara-tive analyses, examination of intron densities in 100sequenced eukaryotic genomes does not present an obvi-ous bimodal distribution (Figure 2). Actually, it appearsthat all intron densities between 0 and 6 introns perkilobase are observed in some eukaryote genomes. How-ever, when intron density is plotted against intronlength, partitioning of eukaryote genomes into twoclasses becomes apparent. While up to a density of ap-proximately 3 introns per kilobase, all introns are short,with no significant correlation between the density andlength of introns, for more intron-rich genomes, astrong positive correlation is observed (linear correlation

    543

    introns per kbp coding sequence

    Lbic

    Sman

    Hrob

    Tcas

    Bmor

    Hsap

    Cint

    TguDrer

    Spur

    Bflo

    Tadh

    C16

    Vcar

    Vvin

    bic ln y=0.558x+2.7089

    is from [53].

  • Rogozin et al. Biology Direct 2012, 7:11 Page 4 of 28http://www.biology-direct.com/content/7/1/11

    coefficient = 0.16, P = 0.003, Figure 2). Even amongintron-rich organisms, vertebrates are outstanding inhaving a substantial fraction of extremely long introns(Figure 2). This strong correlation notwithstanding,there are exceptions to the general trend: intron-rich ba-sidiomycete fungi (3-4 introns/kbp) have only shortintrons whereas some insects show broad intron lengthdistributions with multiple long introns despite relativelylow intron density (2-3 introns/kbp) (Figure 2). We re-turn to the dependencies between intron density, intronlength and structure of splice signals later, in the discus-sion of the selection pressures affecting the evolution ofeukaryote gene architecture and the underlyingpopulation-genetic factors.As pointed out above, despite the existence of numer-

    ous, diverse intron-poor genomes, eukaryotes do notlose the “last” intron or the spliceosome although deg-radation of the spliceosome including loss of many com-ponents does occur, e.g. in yeast. The only firmlyestablished exception is the tiny genome of a nucleo-morph (an extremely degraded intracellular symbiont ofalgae) that has lost both all the introns and the spliceo-some [7]; preliminary genomic data indicate that allintrons might have been lost also in a microsporidium, ahighly degraded intracellular parasite distantly related toFungi [54]. In general, it remains unclear whether thereare any selective factors or functional constraint under-pinning this surprising preservation of at least a fewintrons in eukaryote genomes [55]. However, in manycases, the few introns that are retained in highly reducedgenomes are present in 5’-portions of genes encodingribosomal proteins [16,56]. The introns in these genesare important for regulation of expression and ribosomalbiogenesis, and their deletion leads to significant fitnessreduction in yeast [57]. Thus, the extreme rarity ofcomplete loss of introns in eukaryotes at least in part islikely to be due to deleterious effect of the loss of spe-cific, functionally important introns.

    Evolutionary conservation of intron positions androutes of gene architecture evolution ofeukaryotesThe realization that (nearly) all eukaryotes possess ‘genesin pieces’ but that the intron densities and size widelyvary, triggered intense, ongoing discussion of possibleevolutionary scenarios behind these patterns. Severalmechanisms of intron evolution have been suggested in-cluding intron loss, gain, and sliding [44,58-61]. Intronloss and gain are the major phenomena in the evolutionof eukaryotic gene architecture. The relative contri-butions of these two processes have been a matter ofconsiderable debate and controversy. Systematic com-parative analyses of exon-intron structures of ortholo-gous genes from animals, fungi and plants have shown

    that approximately 25% to 30% of the intron positionsare shared (that is, located in the exact same position inorthologous genes) by at least two of these three lineagesof complex eukaryotes with intron-rich genomes [45,62].The prevailing interpretation of these fundamentalobservations is that most, if not all, introns that occupythe same positions in orthologous genes are conserved,i.e. were already present in the equivalent position of thecorresponding ancestral gene. However, the alternativeview, i.e., that a substantial fraction or even most of theshared introns have been independently inserted in thesame position in orthologous genes in different lines ofdescent, cannot be automatically dismissed (see discus-sion below).The apparent conservation of many intron positions in

    distant eukaryotes notwithstanding, intron densities ineukaryotic genomes differ widely (see above), and the lo-cation of introns in orthologous genes does not alwayscoincide even in closely related species [63-65]. Likelycases of intron insertion and the more common intronloss have been described (e.g., [59,63,66-82], and indica-tions of high intron turnover rate in some eukaryoticlineages have been obtained [42]. Furthermore, parsi-mony and maximum likelihood (ML) reconstructions ofevolution of exon-intron structure of highly conservedeukaryotic genes (see details below) suggest that bothloss and gain of introns have been extensive during evo-lution of eukaryotic genes [45,83-88]. Together theresults of these analyses indicate that the rates of introngain and loss differ widely between eukaryotic lineages.Some eukaryotes, such as yeast Saccharomyces cerevi-siae, seem to have lost nearly all of the ancestral introns,whereas others, e.g., nematodes, have experienced highlydynamic evolution, with both loss and acquisition of nu-merous introns [45,83-88]. However, intron gain is noteasy to detect: comparative analysis of intron positionsin orthologous genes from vertebrates revealed only afew losses but no apparent gain of introns in mammaliangenes [89,90], and similar results have been obtained inan analysis of evolution of exon-intron structure of par-alogous genes in several eukaryotic lineages [91]. Thesefindings imply that intron loss dominates at short evolu-tionary distances, whereas bursts of intron insertionmight accompany major evolutionary transitions. How-ever, intron gain could be an ongoing process in nema-todes: Coghlan and Wolfe [66] identified 81 introns inCaenorhabditis elegans and 41 introns in C. briggsae thatappear to have been recently inserted. However, the val-idity of these recent intron gains has been questioned asit has been shown that after including additional gen-omes in the analysis many of the reported intron gainscould be parsimoniously explained by intron loss [92]. Ahigh rate of recent intron gain has been reported forparalogous gene pairs in Arabidopsis thaliana that were

  • Rogozin et al. Biology Direct 2012, 7:11 Page 5 of 28http://www.biology-direct.com/content/7/1/11

    duplicated simultaneously 20-60 million years via tetra-ploidization [93]. A low rate of recent intron gains hasbeen identified in plastid-derived genes in plants [94].Similarly, spliceosomal introns have been detected insome genes horizontally transferred from bacteria tobdelloid rotifers [95]. Probably, the most striking knowncase of apparent recent intron gains has been found inpopulations of Daphnia pulex endemic to Oregon wheretwo polymorphic introns have been identified [70].These new introns do not have an obvious source andare not represented in any studied D. pulex populationsoutside Oregon, other species of Daphnia or any otherorganism for which sequence data are available. Further-more, the new introns are both found in the same genethat encodes a Rab GTPase (rab4), and are insertedwithin one base pair from each other. These findings putinto doubt the rarity of intron gain considering that twointron gain events have been discovered in an initial sur-vey of only 6 nuclear loci in 36 Daphnia individuals [70].This result was further supported by the analysis of 24discordant intron/exon boundaries between the whole-genome sequences of two Daphnia pulex isolates. Se-quencing of intron presence/absence loci across a collec-tion of D. pulex isolates and outgroup Daphnia specieshas shown that most polymorphisms result from recentgains, with parallel gains often occurring at the same lo-cation in independent allelic lineages [96].The great majority of studies aimed at reconstruction

    of evolution of gene architecture in eukaryotes have fo-cused on introns in conserved portions of protein-coding regions. For example, the conclusion that therewas no appreciable intron gain in mammals [89] is basedon this type of data. However, evolution of poorly con-served segments of protein-coding sequences, untrans-lated regions of protein-coding genes, alternativelyspliced regions, and genes originated from transposableelements appears to be much faster and more dynamic,with numerous intron gains in mammals [97-101]. A caseof such intron acquisition has been reported for theRNF113B retrogene that encodes a RING finger protein (apredicted E3 subunit of ubiquitin ligase of unknown speci-ficity) and is present in the genomes of all primates(Figure 3) [101]. This primate-specific gene underwentrapid evolution that included an intron gain. The presenceof the intron is supported by several human mRNAsequences and comparisons with multiple primate gen-omes (marmoset, macaque, orangutan, and chimpanzee).Sequence alignment analysis shows that the intron ofRNF113B is not a de novo insertion but rather a derivativeof an exonic sequence (a tandem mutation AG >GT gen-erated the donor site). The new intron contains 59 nucleo-tides from former coding sequence and 46 nucleotidesfrom the 3’ UTR. This finding was further supported bysequencing of the human RNF113B cDNAs which

    revealed two alternative RNF113B isoforms (Figure 3)[101]. In general, due to the lack of evolutionary conserva-tion in such genes and gene regions, reconstruction ofintron gain and loss events in their evolution is difficultand sometimes inaccurate (especially without experimen-tal verification). Accordingly, evolutionary studies tend toconcentrate on highly conserved genes. Thus, the conclu-sions on intron stasis in some groups of eukaryotes, suchas mammals, in part appear to stem from sampling biaseswhereas the overall intron turnover might be much moreextensive than is currently appreciated.The same problem pertains to non-coding RNA genes.

    For example, mammalian genomes contain numerous (>10,000) genes for long non-coding RNAs (lncRNAs) thatencompass numerous introns [102]. In a recent detailedstudy, over 8,000 lncRNA genes have been identified,with a mean intron density of ~1.9 per kilobase, and ex-tensive alternative splicing of these non-coding RNAshas been detected, with ~2.3 RNA isoforms per gene[103]. One of the best studied lncRNAs is Xist which isinvolved in X-chromosome inactivation in females of eu-therian mammals [104]. The Xist RNA appears to haveevolved as a result of pseudogenization of the Lnx3protein-coding gene in early eutherians followed by inte-gration of mobile elements [105]. Analysis of Xist in sev-eral mammalian species revealed an overall conservationof the Xist gene structure (Figure 4). Four of the 10 Xistexons found in eutherians show significant sequencesimilarity to exons of the Lnx3 gene (Figure 4) whereasthe remaining 6 Xist exons are similar to different trans-posable elements. Thus, some Xist introns were inher-ited from the Lnx3 gene but some appear to have beengained in the course of evolution of the Xist gene [105].Comparative analysis of >3,000 mouse lncRNA genessuggested that conservation of the exon/intron structuremight be a general lncRNA property [106]. It was foundthat 65% and 40% of mouse lncRNA |GT-AG| splicesites are conserved in human and rat, respectively. Thesenumbers are significantly greater than the number ofconserved intronic GT and AG dinucleotides that arenot involved in splicing indicating evolutionary conser-vation of splice signals in lncRNAs [106]. Detailed re-construction of the origin and evolution of introns inlncRNAs awaits further comparative genomic studies.The distributions of intron positions over the length of

    coding regions differ substantially between eukaryotictaxa. In intron-poor genes of single-cell eukaryotes,introns are strongly over-represented in the 5’-portionswhereas in intron-rich multicellular organisms, the dis-tribution is closer to uniformity [64,65]. A mechanisticexplanation for these patterns suggests that introns arepreferentially lost from the 3’-portion of a gene, conceiv-ably due to the over-representation of the respectivesequences among the cDNAs that are produced by

  • RNF113B mRNA splice variant 1 (intron spliced)

    RNF113B mRNA splice variant 2 (intron retained)

    500 bp 1000 bp

    Protein coding sequence Untranslated region

    New intron

    Figure 3 An example of a recent intron acquisition in a retrotransposon-derived gene: structure of two splice variants of RNF113B. Thenew intron of RNF113B is not a de novo insertion but rather a derivative of exonic sequences (this intron contains 59 nucleotides from the formercoding sequence and 46 nucleotides from the 3’ UTR). A partial alignment of three RNF113B sequences and three RNF113A sequences is shownabove the spliced RNF113B isoform. The donor splice site is marked in yellow, the predicted branch point signal is marked in blue, and theacceptor splice site is marked in gray. The data is from [101].

    Rogozin et al. Biology Direct 2012, 7:11 Page 6 of 28http://www.biology-direct.com/content/7/1/11

    reverse transcription and are thought to mediate intronloss via homologous recombination [65,107-109]. How-ever, a complementary, selectionist interpretation of theobserved distributions of introns, to the effect thatintrons located in the 5’-portion of a gene are moreoften involved in one or more intron-mediated functions(see below), has been proposed as well [65]. Analysis ofdistributions of intron positions over the length of thecoding region suggested that both loss and insertion ofintrons occurred preferentially in the 3’-regions of genes,which suggested reverse-transcription-mediated mechan-isms for both processes [110]. This hypothesis appears tobe compatible with the positive association that has beenshown to exist between the rates of intron gain and loss inindividual genes [111]. However, a more recent probabilis-tic analysis of intron gain and loss indicates that themechanisms of loss and gain are most likely to bedifferent, with reverse transcription involved only in in-tron loss [112].

    exon #

    Lnx3

    primates

    rodents

    dog

    cow

    1 2

    Figure 4 The Xist gene evolved from a protein-coding gene and a setfrom Lnx3; red boxes indicate exons originating from transposable elemenis from [105].

    Intron sliding (also called slippage or migration; here-inafter IS) can be defined as relocation of intron/exonboundaries over short distances (1-60 bases) in thecourse of evolution. Intron sliding has been postulatedby advocates of the introns-early hypothesis to explainthe surprising finding that the positions of apparentlyorthologous introns sometimes varied among lineages[60]. However, the introns-late camp maintained that IS,if it occurs at all, has contributed little to the diversity ofintron positions [44,59]. The reality of IS had beendebated for a long time. A Monte Carlo statistical ana-lysis of broadly sampled data on intron positions hasstrongly suggested that one-base-pair IS, although a rela-tively rare event occurring in

  • Rogozin et al. Biology Direct 2012, 7:11 Page 7 of 28http://www.biology-direct.com/content/7/1/11

    the emergence and fixation of IS during evolution [114].Given the near ubiquity of alternative splicing (AS) inmany groups of animals and possibly plants [48], Tarrioet al. proposed that AS could be an intermediate stagein the evolution of IS. Under this scenario, emergence ofa new splicing signal adjacent to a pre-existing oneresults in AS but, if and when the original splicing signalsubsequently deteriorates, the net result is IS [114]. Theproposed route of IS evolution via AS is likely to becommon in poorly conserved regions of protein-codinggenes with frequent AS events (e.g. 5’- and 3’-regions ofmany genes) but rare in conserved portions of protein-coding genes. Comparative analysis of closely locatedintrons among 12 Drosophila genomes has suggestedthat IS is a relatively frequent cause of novel intron posi-tions in Drosophila [115]. All things considered, there iscurrently no doubt that IS is real and can yield new in-tron positions but the actual impact of IS in the evolu-tion of eukaryotic genes will be accurately determinedonly when multiple sets of closely related genomes be-come available and rigorous methods for statistical ana-lysis are developed.

    Evolution of splicing signals, protosplice sites,and intron phase distributionAs pointed out above, the densities of spliceosomalintrons vary dramatically among eukaryotes (Figure 2),and so does the strength of splicing signals [18,45,51,116].There is a striking correspondence between low introndensity and high information content of donor splice sig-nals across eukaryotic genomes [51]. Intron-poor genes(genomes) with strong donor sites are found in severalgroups of eukaryotes (e.g. fungi) that also include intron-rich genomes with weaker donor sites. Evolutionaryreconstruction suggests that ancestral donor signals hadlow information content but that many lineages have in-dependently underwent concomitant major intron lossand donor signal strengthening [51]. This evolutionarytrend receives a straightforward explanation within theframework of the population-genetic concept of evolutionof gene architecture (see below).However, the acceptor splice signal shows a different

    trend: it is weak in most fungi, intermediate in plantsand some unicellular eukaryotes, and strongest inmetazoans where it gradually strengthens from nema-todes to mammals [116]. This observation can be inter-preted in the light of the results of a large-scale analysis ofsplicing signals in 61 eukaryotic species which revealed asignificant negative correlation between the strength ofthe branch point signal and the strength of the acceptorsplice site (Figure 5; R = -0.52, P = 0.000025) [117].Although the correlation between the strength of thedonor splice signal and the combined strength of thebranch point signal and the acceptor splice signal was not

    significant (R = 0.19, P = 0.15), the positive sign of this cor-relation still could reflect congruent evolution of splicingsignals. In general, a complex interplay exists between in-tron density, intron size, the strength of splice signals andthe strength of splicing enhancers/silencers. For example,splice signals in long and short introns in Drosophila showonly minor differences [118]. Several weak but statisticallysignificant correlations have been observed between verte-brate intron length, splice site strength, and potentialexonic splicing signals that attest to a compensatory rela-tionship between splice sites and exonic splicing signals,depending on vertebrate intron length [119].It has been proposed that the functionally important

    (A/C)AG||G exon sequences flanking introns are relicsof recognition signals for the insertion of introns thathave been dubbed protosplice sites [14,15]. Protosplicesites became an important staple of the introns-late hy-pothesis of intron evolution because, if intron insertionwas limited to strictly defined protosplice sites, parallelgain of introns would be likely and could account forthe large number of shared introns among orthologsfrom distant eukaryotic lineages [41,63,83]. Support forthe protosplice site hypothesis has been harnessed fromexperiments demonstrating that elimination of the regu-lar splice sites in actin genes resulted in activation ofcryptic splice sites, most of which coincided with exonjunctions in orthologous genes from other species [120].Nevertheless, it remained unclear whether the consensusnucleotides flanking the splice junctions were remnantsof the original protosplice sites or evolved convergentlyafter intron insertion. The existence of protosplice siteswas directly addressed by examining the context ofintrons inserted within codons which encode aminoacids conserved in all eukaryotes and, accordingly, arenot subject to selection for splicing efficiency. Accordingto the parsimony principle, these codons (e.g., GGN forconserved glycines or CCN for conserved prolines) canbe inferred to have been present already in the commonancestor of all extant eukaryotes, so the ancient proto-splice sites (if such existed) should have survived andcould be examined directly. This analysis has shown thatintrons, indeed, predominantly insert into and/or arepreferentially fixed in specific (protosplice) sites with theconsensus sequence (A/C)AG||Gt [121].Recently, correlation between positions of cryptic spli-

    cing signals (sequences that are similar to splicing sig-nals but normally do not function in splicing) andintrons has been found: cryptic splicing signals withinexons of one species frequently match the exact positionof introns in orthologous genes from another species.This observation suggests that in the course of evolutionmany introns were inserted into cryptic splicing signalsthat had been in place prior to intron insertion [122].However, this conclusion is contradicted by another

  • 4.00

    5.00

    6.00

    7.00

    8.00

    9.00

    10.00

    4.00 5.00 6.00 7.00 8.00

    Branch point signal

    Acc

    epto

    r si

    gn

    al

    Figure 5 Correlation between the strength of the branch point signal and the strength of the acceptor splice site. The linear correlationcoefficient is R = -0.52 (P = 0.000025) after exclusion of the obvious outlier Aureococcus anophagefferens [117]. The information content of splicingsignals in 61 eukaryotic species is from [117]. Species names: B. taurus, C. familiaris, E. caballus, H. sapiens, M. domestica, M. musculus, O. anatinus, R.norvegicus, S. scrofa, B. florida, C. intestinalis, C. savignyi, D. rerio, G. gallus, O. latipes, P. marinus, T. guttata, X. tropicalis, A. gambiae, A. mellifera, C.elegans, D. pulex, D. melanogaster, H. magnipapillata, L. gigantea, M. brevicollis, N. vectensis, S. purpuratus, T. castaneum, B. dendrobatidis, C.heterostrophus, C. neoformans, M. grisea, N. haematococca, P. chrysosporium, P. blakesleeanus, P. infestans, P. placenta, S. cerevisiae, S. commune, T.virens, A. anophagefferens, D. discoideum, D. purpureum, N. gruberi, O. lucimarinus, P. tricornutum, T. pseudonana, T. adhaerens, A. thaliana, ChlorellaNC64A, C. reinhardtii, M. pusilla, Micromonas RCC299, O. sativa, P. patens, P. trichocarpa, S. moellendorffii, S. bicolor, V. vinifera, V. carteri.

    Rogozin et al. Biology Direct 2012, 7:11 Page 8 of 28http://www.biology-direct.com/content/7/1/11

    observation, that recently gained introns in animal genesof the alpha-amylase were not associated with specificsequence motifs (protosplice sites) [123]. In the samegene family, old introns were embedded within strongprotosplice motifs which were found to be much weakerin homologous genes lacking the intron in the givenposition [123]. These findings are consistent with the hy-pothesis that sites of de novo intron insertion are effect-ively random and that selection drives the emergence ofprotosplice-like sequences following intron insertion.The presence of much stronger protosplice sites aroundold introns compared to young introns [123] seems tosuggest that evolution of protosplice sites subsequent tointron insertion is a slow process [123,124].The hypothesis that selection acts on the exonic parts

    of splice signals was supported by comparison of the nu-cleotide sequences around the splice junctions that flankold (shared by two or more major lineages of eukar-yotes) compared with new (lineage-specific) introns ineukaryotic genes. The distributions of information con-tent between the intronic and exonic parts of the splicessignals have been found to be substantially different inold and new introns [125]. Old introns have lower infor-mation content in the exonic regions adjacent to thesplice sites than new introns but, conversely, have higherinformation content in the intron itself. These findings

    imply that introns insert into protosplice sites but duringthe evolution of an intron after insertion, the splice sig-nal shifts from the flanking exonic regions to the ends ofthe intron itself. Accumulation of information inside theintron during evolution is best compatible with the viewthat new introns, largely, emerge de novo and not viapropagation and migration of pre-existing introns [125].The contradictory findings on the protosplice site in-

    volvement versus the evolution of these motifs after in-tron gain (in which case ‘protosplice site’ becomes amisnomer) might reflect objectively existing differencesin the evolution of the gene architectures among genefamilies, in particular between highly conserved andmore dynamic families. The definitive assessment of thevalidity of the protosplice site hypothesis requires fur-ther, comprehensive comparative genomic studies.Introns occur in three phases (0, 1, and 2) which are

    defined as the position of the intron within or betweencodons: introns of phase 0, 1, and 2 are located, respect-ively, between two codons, after the first position in acodon, and after the second position. In (nearly) all ana-lyzed genomes, there is a significant excess of phase 0introns over those in the other two phases [125-130].The only known remarkable exception is the rapidlyevolving tunicate Oikopleura that shows a uniform dis-tribution of introns among the three phases [131].

  • Rogozin et al. Biology Direct 2012, 7:11 Page 9 of 28http://www.biology-direct.com/content/7/1/11

    An excess of protosplice sites in phase 0 was notice-able in some species (Figure 6) [132], however theprotosplice site model cannot fully explain the observedover-representation of phase 0 introns under the as-sumption that introns randomly insert into protosplicesites (Figure 6) [125,127,128]. Furthermore, it has beenshown that phase 0 introns were, on average, located inmore highly conserved portions of genes than phase 1and 2 introns [45]. This observation suggests that phase1 and phase 2 introns experience a greater deleterious-mutation-driven loss and could reconcile the observedphase distribution with the protosplice site hypothesis[125,127,128,130].

    Conservation versus parallel gains of intronsAs discussed above, comparative analyses revealed nu-merous introns that occupy the same position in ortho-logous genes from distant species [45,62]. In particular,orthologous genes from humans and the green plant A.thaliana share ~25% intron positions [45]. The straight-forward interpretation of these observations is that theshared introns were inherited from the common ances-tor of the respective species whereas lineage-specificintrons were inserted into genes at later stages of evolu-tion [45,62]. Under this premise, parsimonious recon-structions indicate that even early eukaryotes alreadyhad a relatively high intron density, perhaps, comparable(at least within an order of magnitude) to that in mod-ern plant and animal genes. However, the inference thatshared intron positions reflect evolutionary conservationis challenged by the potential non-randomness of introninsertion: introns are inserted or fixed mostly in distinctprotosplice sites as discussed in the preceding section.In principle, if there were few protosplice sites in eachgene, the presence of introns in the same positions of

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    Phase 0 Phase 1 P

    Figure 6 Fractions of protosplice sites and actual introns in the three(Hs) human Homo sapiens. An excess of protosplice sites in phase 0 is noticintrons are randomly inserted into protosplice sites, is unable to fully explafrom [125,132].

    orthologous genes in distantly related species could becompletely or at least to a large extent explained by par-allel gains. At least two cases of apparent parallel gain ofintrons in orthologous genes from plants and animalshave been reported [133,134]. Moreover, probabilisticmodeling of intron evolution discussed above suggestedthat many, if not most, introns shared by phylogenetic-ally distant species were likely to originate by parallelgain of introns in protosplice sites [83]. The implicationis that intron distribution in extant organisms is largelydetermined by relatively recent insertions and cannot beused to infer exon-intron structure of ancestral genes.However, the dataset of 10 gene superfamilies by Qiuand co-workers [83] contained numerous ancient dupli-cations combined with frequent lineage-specific losses ofgenes, because of which analysis of intron conservationand intron gains is likely to be confounded by problemsof phylogenetic reconstructions.The extent of independent insertion of introns in the

    same sites (parallel gain) in orthologous genes fromphylogenetically distant eukaryotes was assessed withinthe framework of the protosplice site model [132]. It wasshown that protosplice sites are no more conserved dur-ing evolution of eukaryotic gene sequences than randomsites. Simulation of intron insertion into protosplice siteswith the observed protosplice site frequencies and introndensities has shown that parallel gain could account foronly 5 to 10% of shared intron positions in distantlyrelated species [132]. The results of this simulation sug-gest that the presence of numerous introns in the samepositions in orthologous genes from diverse eukaryotes,such as animals, fungi, and plants, reflects mostly bonafide evolutionary conservation [132].Analysis of intron gain and loss rates over branches of

    the phylogenetic tree for 19 eukaryotic species allowed

    hase 2

    At_introns

    At_prososplice sites

    Hs_introns

    Hs_protosplice sites

    phases. Species abbreviations: (At) green plant Arabidopsis thaliana,eable, however the ‘protosplice site’ hypothesis, which posits thatin the observed over-representation of phase 0 introns. The data is

  • Rogozin et al. Biology Direct 2012, 7:11 Page 10 of 28http://www.biology-direct.com/content/7/1/11

    for the estimation of the probability of parallel gain foreach intron position that is shared by more than onespecies [111]. The resulting estimates indicated that par-allel gain, on average, accounts for only ~8% of theshared intron positions, in agreement with the simula-tion results discussed above [111]. However, the distri-bution of parallel gains over the phylogenetic tree ofeukaryotes is highly non-uniform. There are almost noparallel gains in closely related lineages, whereas fordistant lineages, such as animals and plants, parallelgains could contribute up to 20% of the shared intronpositions. Taken together, the results of these analysesindicate that, although parallel gain of introns is non-negligible, the substantial majority of introns that occupythe same positions in orthologous genes from distantlyrelated eukaryotes are ancestral including many inher-ited from LECA [111].

    Reconstruction of evolution of exon-intronstructure of eukaryote genesThe patterns of conservation and variation of intronpositions in orthologous and paralogous genes can beemployed to reconstruct evolutionary scenarios for theexon-intron structure of eukaryotic genes using evolu-tionary parsimony or maximum likelihood approaches.Once multiple eukaryotic genomes have been sequenced,such genome-wide evolutionary reconstruction has be-come realistic. The comparative data on intron positionscan be naturally represented as a matrix of intron pres-ence/absence (usually encoded as 1/0), and to thesematrices, various reconstruction methods can be applied.The first of such studies employed orthologous gene setsfrom 8 eukaryotic species and took the most straightfor-ward approach to evolutionary reconstruction by apply-ing the parsimony principle in the specific form of Dolloparsimony [45]. Given a species tree topology and a pat-tern of intron presence/absence, the Dollo algorithmconstructs the most parsimonious (simplest) scenario forthe evolution of gene structure, i.e. the distribution ofinferred intron gain and loss events over the treebranches. The main underlying assumption is that acharacter (intron in a given position) once lost cannotbe regained whereas as many parallel intron losses indifferent branches of the tree are allowed as needed toaccount for the observed pattern of states. By analyzingmore than 7,000 intron positions in highly conservedgenes of eukaryotes, the Dollo parsimony approachproduced an evolutionary scenario under which thecommon ancestor of modern eukaryotes possessedintron-rich genes, with intron density only a few foldlower than that in most intron-rich modern forms(vertebrates and plants). Massive intron losses were in-ferred for several groups, especially yeasts, nematodes

    and insects, whereas in vertebrates and plants introngain was inferred to be the main evolutionary trend [45].The parsimony approach is obviously oversimplified

    given that all lineage-specific introns are automaticallytreated as newly gained, with the possibility that some ofthese introns could be ancestral, having been lost inmultiple lines of descent. Furthermore, parsimony doesnot provide confidence estimates for the estimates of an-cestral states. These limitations of parsimony potentiallycould grossly distort the results of evolutionary recon-struction, especially if the number of analyzed species issmall. Probabilistic approaches such as maximum likeli-hood models can address these problems, at least inprinciple. However, the first two statistical studies intointron evolution produced opposite results: Qiu et al.concluded that ancient introns (if they ever existed) havenot survived in extant genes [83] whereas Roy andGilbert came to the conclusion that the great majority ofintrons, even lineage-specific ones, were ancient [84].The first conclusion implies that intron gain was domin-ant over intron loss in the evolution of eukaryotic genes,whereas the second one suggests that intron loss is theprincipal evolutionary process. This major discrepancybetween the results of the two studies has indicated thatoptimal parameters for maximum likelihood modelingof intron evolution remained to be determined [135].The next generation of increasingly sophisticated ML

    reconstructions of intron gain and loss during eukaryoticevolution suggested that the protein-coding genes of an-cient eukaryotic ancestors, including the Last EukaryoticCommon Ancestor (LECA), already possessed introndensity comparable to that found in modern, moderatelyintron-rich genomes [85-88,136]. Accordingly, the his-tory of eukaryotic genes, with respect to the dynamics ofintrons, appears to be to a large extent dominated bylosses, perhaps punctuated by a few episodes of majorgain [87,88,91,137]. This conclusion is based on analysesxof considerably larger data sets than those used in earl-ier studies; for example, Carmel and co-workers [87]analyzed 391 sets of orthologous genes from 19eukaryotic species. This extended data set not onlyallowed for a more definitive reconstruction of the genestructure evolution, but also permitted zooming in onspecific portions of the eukaryotic tree [87]. A compre-hensive probabilistic model of intron evolution wasdeveloped that incorporated heterogeneity of intron gainand intron loss rates with respect to both lineages andgenes as well as variability among sites within a gene[87]. It was demonstrated that ancestral eukaryoticforms were intron-rich and that evolution of eukaryoticgenes involved numerous gains and losses of introns,with losses being somewhat more common. Three dis-tinct modalities of intron gain and loss during eukaryoticevolution were identified. The ‘balanced’ mode appears

  • Rogozin et al. Biology Direct 2012, 7:11 Page 11 of 28http://www.biology-direct.com/content/7/1/11

    to operate in all eukaryotic lineages, and is characterizedby proportional intron gain and loss rates [87]. In additionto this, apparently universal process, many lineages exhibitelevated loss rate, whereas a few others exhibit elevatedgain rate. Moreover, the rates of intron gain and loss arehighly non-uniform over the time course of the evolutionof eukaryotes: both rates seem to have been decreasingwith time over the last several hundred million years. Thedecrease in gains was faster than the decrease in losses,resulting in many lineages with very limited intron gainover the last several hundred million years [87].Figure 7 illustrates the latest reconstruction of intron

    gain and loss for 99 eukaryotic species that was performedusing a Markov Chain Monte Carlo (MCMC) approach[53]. In this, so far the most extensive study, the Malinsoftware package [138] was used to infer ancestral statesfrom a matrix of shared introns which comprised 8403 in-tron presence-absence profiles from 245 sets of ortholo-gous genes. The MCMC method infers ancestral introndensity by using a probabilistic intron gain-loss model,taking into account rate heterogeneity across lineages andacross sites within genes. This reconstruction provides athorough view of the evolution of gene structure in threeeukaryotic supergroups and reveals several general trends(Figure 7) [53]. Most lineages show net intron loss thatcan be substantial as in alveolates, some lineages of fungi,green algae or insects, or offset by concomitant introngains as in land plants, most animal lineages, and somefungi. Massive intron gains were inferred only for severaldeep branches, most conspicuously the stem of theMetazoa, and to a lesser extent, the stems of Mamiellales(a branch of green algae), Viridiplantae, Opisthokonta,and Metazoa together with Choanoflagellata (Figure 7).These findings vindicate, on a much larger data set andwith greater confidence, the previous conclusions thatcompared to the common and substantial intron loss, ex-tensive intron gain was rare during the evolution of eukar-yotes. Episodes of substantial intron gain seem to coincidewith the emergence of major new groups of organismswith novel biological characteristics such as the Metazoa(Figure 7) [53]. Several previous studies, performed onmuch smaller data sets and with less robust reconstruc-tion methods, have suggested that at least some eukaryoticancestral forms could have possessed intron-rich genes[84,85,136], and observations on gene structures in pri-mitive animals such as the sea anemone Nematostella[139] and the flatworm Platynereis [140] were compatiblewith these inferences. A particularly striking conclusionhas been reached in the reconstruction of the evolutionof gene architecture in Chromalveolata: although all se-quenced genomes in this supergroup of eukaryotes are in-tron-poor, intron-rich last common ancestors have beeninferred for Chromalveolata and particularly Alveolata[141]. Clearly, the reconstruction led to this conclusion

    only because, although very few intron positions are con-served among the intron–poor orthologous genes of dif-ferent chromalveolates, many of these genes share a largefraction of intron positions with intron-rich orthologsfrom plants and/or animals.The latest MCMC reconstruction reinforced these

    conclusions by inferring high intron densities for theancestors of each major group of eukaryotes within eachof the three supergroups (Figure 7) [53]. The implicationis that, whenever an extant eukaryotic genome shows alow intron density, this intron-poor state is a result ofextensive, lineage-specific intron loss. Remarkably, somany intron positions are shared between eukaryotesthat, with the large and apparently representative set ofcompared genomes, Dollo parsimony reconstructioninfers similarly intron-rich ancestral genomes as theMCMC and maximum likelihood methods [53]. Theresults of this reconstruction indicate in particular thatthe entire line of descent from LECA to mammals was acontinuous intron-rich state (Figure 7) that would pro-vide for uninterrupted evolution of the growing reper-toire of functional alternative spliced forms (see below).The unprecedented intron gain at the onset of animalevolution (Figure 7) could further contribute to theexpansion of alternative forms. This spurt of intron gainconceivably resulted from a combination of a populationbottleneck that led to weak purifying selection withincreased transposon activity (see below).

    Evolution of exon-intron structure in paralogousgene familiesThe reconstructions of the evolution of gene architec-ture in eukaryotes were performed on sets of ortholo-gous genes with a single representative (or a single mostconserved representative) in each of the analyzed gen-omes. Obviously, this type of reconstruction reflects onlyone facet of evolution of gene structure given that alleukaryotic genomes encompass numerous families ofparalogous genes with broad distributions of the numberof members. Reconstruction of parsimonious scenariosof gene structure evolution in paralogous gene familiesin animals, plants and malaria parasites revealed numer-ous apparent gains and losses of introns [91,142]. In allanalyzed lineages, the number of acquired new intronswas substantially greater than the number of lost ances-tral introns. This trend held even for lineages in whichvertical evolution of genes involved many more intronlosses than gains, suggesting that gene duplicationboosts intron insertion. However, dating gene duplica-tions and the associated intron gains and losses basedon the molecular clock assumption showed that veryfew, if any, introns were gained during the last approxi-mately 100 million years of animal and plant evolution,in agreement with previous conclusions reached through

  • Ehux

    Pram

    Pcap

    Psoj

    Aano

    Tpse

    Ftri

    Fcyl

    Ptet

    Tthe

    Tgon

    Pfal

    Pviv

    Pyoe

    Bbov

    Tann

    Tpar

    Spun

    Bden

    Sjap

    Spom

    Scer

    Bfuc

    Sscl

    GzeaM

    griNcra

    Anid

    Cimm

    Uree

    Mfij

    Mgra

    PnodC

    hetPrep

    Mlar

    PgraSros

    Um

    ayC

    neoC

    cinA

    bisL

    bicFchr

    Am

    acR

    oryM

    cir

    Pbla

    Cow

    c Mbr

    e

    Prsp

    Sman

    Lgi

    gC

    tel

    Hro

    bB

    mal

    Cbr

    iC

    ele

    Isca

    Dpu

    lA

    pis

    Phum

    Tca

    sA

    mel

    Nvi

    tB

    mor

    Aae

    gA

    gam

    Dm

    elD

    moj

    Spur

    Bflo

    Cint

    Drer

    Hsap

    Ggal

    Tgut

    Nvec

    Hmag

    Tadh

    Ddis

    Dpur

    Ehis

    Edis

    C169

    C64A

    Crei

    Vcar

    Otau

    Oluc

    O809

    M299

    Mpus

    Ppat

    Smoe

    Atha

    Vvin

    Osat

    Sbic3.2

    1.2

    1.1

    1.2

    0.4

    1.0

    0.5

    0.8

    2.0

    2.5

    4.3

    1.2

    1.2

    1.1

    1.4

    2.4

    2.4

    3.9

    3.4

    0.9

    1.0

    0.1

    1.9

    1.9

    1.91.8

    1.92.2

    2.42.3

    1.41.3

    2.02.1

    2.14.8

    4.74.8

    0.44.8

    4.74.9

    5.1

    4.6

    1.6

    2.6

    2.4

    3.3

    2.6 5.3

    6.4

    4.2

    6.6

    6.1

    7.3

    5.9

    3.2

    3.5

    5.3

    4.7

    4.2

    3.6

    2.5

    3.6

    3.5

    4.1

    1.8

    1.6

    1.7

    1.7

    6.5

    7.1

    4.5

    6.7

    6.9

    6.9

    7.0

    7.7

    6.0

    7.8

    1.3

    1.5

    0.1

    0.2

    6.4

    7.2

    6.3

    6.0

    0.6

    0.6

    0.7

    1.0

    1.2

    5.5

    6.0

    6.0

    6.1

    6.1

    6.1

    LECA 4.3

    Bilateria 7.7

    Opisthokonts 5.1

    Metazoa 8.8

    Apicomplexa 4.2

    Embryophyta 6.4

    Dikarya 4.2

    Het

    erok

    onts

    Alv

    eola

    tes

    Chr

    omal

    veol

    ata

    Fungi

    Asco

    myce

    tes

    Basidiomy

    cetes Filozoa

    Spiralia Nematodes

    Insects

    Vertebrates

    Am

    oebozoa

    Archaeplastida

    Green algae

    Land plants

    Figure 7 Reconstruction of intron gains and losses in the evolution of eukaryotes and intron density in ancestral eukaryote forms. Thedata is from [53]. Branch widths are proportional to intron density which is shown next to terminal taxa and some deep ancestors, in units of theintrons count per 1 kbp coding sequence. Human (Hsap) is marked by a blue dot. Horizontal bars show ancestral (top) and current (bottom)intron content; gain and loss (in the lineage from the respective ancestor) are shown by red and green, respectively. The bars are aligned so thatthe pale yellow part shows the retained introns from the ancestor. Species names and abbreviations: Aureococcus anophagefferens (Aano), Aedesaegypti (Aaeg), Agaricusbisporus (Abis), Anopheles gambiae (Agam), Allomyces macrogynus (Amac), Apis mellifera (Amel), Aspergillus nidulans (Anid),Acyrthosiphon pisum (Apis), Arabidopsis thaliana (Atha), Babesia bovis (Bbov), Batrachochytrium dendrobatidis (Bden), Branchiostoma floridae (Bflo),Botryotinia fuckeliana (Bfuc), Brugia malayi (Bmal), Bombyx mori (Bmor), Coccomyxa sp. C-169 (C169), Chlorella sp. NC64a (C64a), Caenorhabditisbriggsae (Cbri), Caenorhabditis elegans (Cele), Coprinopsis cinerea okayama (Ccin), Cochliobolus heterostrophus C5 (Chet), Coccidioides immitis(Cimm), Ciona intestinalis (Cint), Cryptococcus neoformans var. neoformans (Cneo), Chlamydomonas reinhardtii (Crei), Capitella teleta (Ctel),Capsaspora owczarzaki (Cowc), Dictyostelium discoideum (Ddis), Dictyostelium purpureum (Dpur), Drosophila melanogaster (Dmel), Drosophilamojavenis (Dmoj), Daphnia pulex (Dpul), Danio rerio (Drer), Entamoeba dispar (Edis), Entamoeba histolytica (Ehis), Emiliania huxleyi (Ehux),Fragilariopsis cylindrus (Fcyl), Phanerochaete chrysosporium (Fchr), Phaeodactylum tricornutum (Ftri), Gallus gallus (Ggal), Gibberella zeae (Gzea), Hydramagnipapillata (Hmag), Helobdella robusta (Hrob), Homo sapiens (Hsap), Ixodes scapularis (Isca), Laccaria bicolor (Lbic), Lottia gigantea (Lgig),Micromonas sp. RCC299 (M299), Monosiga brevicollis (Mbre), Mucor circinelloides (Mcir), Mycosphaerella fijiensis (Mfij), Mycosphaerella graminicola(Mgra), Magnaporthe grisea (Mgri), Melampsora laricis-populina (Mlar), Micromonas pusilla (Mpus), Neurospora crassa (Ncra), Nematostella vectensis(Nvec), Nasonia vitripennis (Nvit), Ostreococcus sp. RCC809 (O809), Ostreococcus lucimarinus (Oluc), Oryza sativa japonica (Osat), Ostreococcus taurii(Otau), Phytophthora capsici (Pcap), Plasmodium falciparum (Pfal), Puccinia graminis (Pgra), Pediculus humanus (Phum), Phaeosphaeria nodorum(Pnod), Physcomitrella patens subsp. patens (Ppat), Phytophthora ramorum (Pram), Pyrenophora tritici-repentis (Prep), Proterospongia sp. (Prsp),Phytophthora sojae (Psoj), Paramecium tetraurelia (Ptet), Plasmodium vivax (Pviv), Plasmodium yoelii yoelii (Pyoe), Rhizopus oryzae (Rory), Sorghumbicolor (Sbic), Saccharomyces cerevisiae (Scer), Schizosaccharomyces japonicus (Sjap), Schistosoma mansoni (Sman), Selaginella moellendorffii (Smoe),Schizosaccharomyces pombe (Spom), Spizellomyces punctatus (Spun), Strongylocentrotus purpuratus (Spur), Sporobolomyces roseus (Sros), Sclerotiniasclerotiorum (Sscl), Trichoplax adhaerens (Tadh), Theileria annulata (Tann), Tribolium castaneum (Tcas), Toxoplasma gondii (Tgon), Taenopygia guttata(Tgut), Theileria parvum (Tpar), Thalassiosira pseudonana (Tpse), Tetrahymena thermophila (Tthe), Ustilago maydis (Umay), Uncinocarpus reesii (Uree),Volvox carteri (Vcar), Vitis vinifera (Vvin).

    Rogozin et al. Biology Direct 2012, 7:11 Page 12 of 28http://www.biology-direct.com/content/7/1/11

    analysis of orthologous gene sets. These results are gen-erally compatible with the emerging notion of intensiveinsertion and loss of introns during transitional epochsin contrast to the relative quiet (stasis) of the interveningevolutionary spans [91,137,143]. The prevalence of in-tron gain over intron loss in evolving families of paralogsremains a somewhat controversial issue. It has been sug-gested that the Dollo parsimony approach used byBabenko and co-workers [91] could significantly under-estimate the rate of intron losses [144]. However, even

    should that be the case, the independently estimatednumber of intron gains in the same data set that wasused in the original work of Babenko and coworkers[91] still exceeded the number of intron losses [144].Furthermore, numerous anecdotal observations (e.g.,[93,145-147]) have suggested that at least some parala-gous gene families have gained more introns than theyhave lost.In contrast, comparison of the exon–intron structures

    of ancient eukaryotic paralogs reveals the absence of

  • Rogozin et al. Biology Direct 2012, 7:11 Page 13 of 28http://www.biology-direct.com/content/7/1/11

    conserved intron positions in these genes (Figure 8)[148]. This is in contrast to the conservation of intronpositions in orthologous genes from even the most evo-lutionarily distant eukaryotes and in more recent para-logs (Figure 8) [91]. The lack of conserved intronpositions in ancient eukaryotic paralogs probably reflectsthe origin of these genes during the earliest phase ofeukaryotic evolution that was characterized by concomi-tant invasion of genes by group II self-splicing elements(which were to become spliceosomal introns subse-quently; see below) (Figure 9) and extensive duplicationof genes [148,149]. Similar estimates were obtained forparallel intron gains in ‘pseudo-paralogous’ genes encod-ing cytosolic and mitochondrial ribosomal proteins thatby definition have acquired their intron independently:approximately 2.3% of the intron positions were foundin homologous positions [150]. The lack of conservedintrons in ancient eukaryotic paralogs [148,150] is con-sistent with the results of an earlier analysis of introndistribution in 20 most ancient (duplicated before the di-vergence of bacteria and archaea) paralogous familieswhich appear to have accumulated introns independ-ently [151]. Along with other lines of evidence, theseobservations do not seem to be compatible with theintrons early hypothesis.

    Evolution of exon-intron structure in connectionwith other features of eukaryote genesThe combined advances of comparative genomics andsystems biology provide means to characterize genes bymany features, for example expression level and connect-ivity in protein-protein interaction or regulatory networks.Various significant correlations have been demonstratedto exist between these variables; in particular, one of themost prominent, recurrent links is that the sequence of

    4000 2000 2000

    3631 total

    2282 shared 126 shared

    Recent /Ancient

    Figure 8 Conservation of intron positions in ancient and recent eukarmultiple alignments of paralogous sequences from 6 species (H. sapiens, C.position was considered to be conserved if it was shared by any pair of pa

    highly expressed genes tends, on average, to be more con-served [152-154]. Connections between various features ofintrons and other characteristics of genes also haveemerged. Here, we discuss the links between two key fea-tures of introns, the rates of gain and loss and intronlength, and other aspects of gene evolution, expressionand function.Probabilistic evolutionary reconstruction of gene

    structure yields gene-specific rates of intron gain andloss and thus provides for analysis of possible relation-ships between these rates and other characteristics ofthe respective genes [87]. It has been shown that introngain rate was negatively and significantly correlated withthe sequence evolutionary rate; conversely, intron lossrate was positively and significantly correlated with therate of sequence evolution. Thus, perhaps somewhatcounter-intuitively, highly conserved genes appear to ac-cumulate introns in the course of evolution, even ifslowly. Also significant, although of a lesser magnitude,was the positive correlation between gene expressionlevel and intron gain rate and the converse negative ofexpression with intron loss rate. This finding suggeststhat introns might contribute to optimal gene expression[155] although this effect is confounded by the strongerconnection between expression and evolution rate.Although expression may be enhanced by the mere

    presence of multiple introns in a gene, highly expressedgene in human and Drosophila have, on average, shorterintrons than genes expressed at a lower level [156]. Thisfinding has been subsequently validated and expanded byseveral independent research groups on other modeleukaryotes, for exon lengths as well, and for a variety ofmethods used to measure expression level [157-165]. Twocompeting (although not necessarily mutually exclusive)hypotheses have been proposed to explain the apparent

    40000 8000

    7584 total

    yotic paralogs. Conservation of introns was assessed by analysis ofelegans, D. melanogaster, S. pombe, S. cerevisiae, A. thaliana). An intronralogs [148].

  • -proteobacterialancestor of mitochondria

    Archaea-likeancestor ofeukaryotes

    Group II introns/retroelements

    Group II introns/ retroelements –restricted spread

    Emergingproto-eukaryoticcell

    α-proteobacterialendosymbiont

    Massiveinvasion ofGroup IIintrons into host genome

    Origin ofspliceosomal introns/genes in pieces

    α

    Figure 9 A hypothetical scenario of early history of spliceosomal introns. The scheme shows the inferred sequence of events from putativeancestors of eukaryotes to the origin of spliceosomal introns from group II introns invading the host genome upon mitochondrial endosymbiosis[46].

    Rogozin et al. Biology Direct 2012, 7:11 Page 14 of 28http://www.biology-direct.com/content/7/1/11

    link between gene compactness and expression. The selec-tion hypothesis holds that evolution of highly expressedgenes is driven by selection for minimization of the timeof transcription and/or energy expenditure resulting inshrinking of these genes, especially introns [156]. Thealternative view, known as the genomic design hypothesis,holds that genes that are expressed under tight tissue- anddevelopmental stage-dependent control require complexregulation and therefore need long introns to accommo-date additional regulatory elements. Under the genomicdesign view, due to the positive association between thebreadth and rate of gene expression, genes that are consti-tutively expressed at a high level and do not require com-plex regulation possess shorter introns [160].Surprisingly, however, the opposite trend has been

    reported to exist in plants, with highly expressed genescontaining longer introns [166]. This discrepancy wasresolved by examining the relationship between genelength and expression level at a finer resolution: the rela-tionship between intron length (as well as other mea-sures of gene compactness such as the length of exonsor entire genes) and expression level is universal acrossall eukaryotes (for which sufficient amount of data onexpression was available) but is non-monotonic [167].Genes that are highly expressed indeed tend to haveshorter introns but genes expressed at low to mediumlevels show a positive correlation between intron lengthand expression; hence a roughly bell-shaped dependencybetween expression level and intron length (Figure 10)[167]. The phenomena that underlie this non-monotonicdependency are not quite clear but might involve com-petition between two opposing trends. Selective pressureto maximize expression rate and minimize energy ex-penditure could be dominant in highly expressed genesas originally suggested [156]. In contrast, requirementfor more complex regulation in moderately expressedgenes that gain additional functions with increasedexpression might result in the positive correlation be-tween intron length and expression [167].

    A population-genetic perspective on evolution ofintrons and eukaryotic gene architectureThe question famously posed by Walter Gilbert in theseminal note on the origin of splicing [1] - Why Genesin Pieces? - certainly remains pertinent to this day. Tofurther sharpen the question: Why are some genomes,in particular those of multicellular eukaryotes (plantsand animals), intron-rich whereas others, i.e. those ofthe great majority of unicellular eukaryotes, are intron-poor? In principle, accumulation of introns in genes ofmulticellular organisms could be considered as an adap-tation that ensures evolution of organismal complexity,especially via AS. This is indeed the position taken bythe proponents of the genome design hypothesis dis-cussed in the preceding section. However, a simplerexplanation that appears to be better compatible withthe data has been proposed by Lynch as part of the non-adaptive theory for the evolution of complexity[42,49,50,168,169]. A simple estimate based on the num-ber of nucleotide sites required for accurate intron exci-sion during splicing (that is, the donor and acceptorsites and the branching point motif ) shows that thepower of purifying selection is sufficient to eliminate themajority of introns only in populations with a large ef-fective population size (Ne) such as found in many uni-cellular eukaryotes (Ne ~ 107 - 108) [50,170] but not inthe relatively small populations of vascular plants andvertebrates (Ne ~ 105-106 and 104-105, respectively)[50,170,171]. Numerical simulations based on this esti-mate reveal a phase transition-like shift from intron-richto intron-poor genomes [50,168,169] which roughlymatches the observed distribution of intron densities(see Figure 2).This non-adaptive population genetic perspective on the

    evolution of introns and eukaryotic gene architecture iscompatible with the results of empirical reconstructionaccording to which the general (perhaps counter-intuitive)trend is evolution of intron-poor genomes in multiplelineages from intron-rich ancestors (see Figure 2). This

  • 0 10 20 300

    1

    2

    3

    4

    5

    6

    7

    8x 10

    4

    expression levelcategory

    Intron

    length

    Figure 10 Total intron length as a function of expression level category. Intron length is measured in nucleotides. Expression levels arebinned into 30 categories, with higher categories matching higher expression levels, as described previously [167]. Each point is the mean valuefor all genes in the given expression category, and the error bar indicates the standard deviation of the mean.

    Rogozin et al. Biology Direct 2012, 7:11 Page 15 of 28http://www.biology-direct.com/content/7/1/11

    evolutionary trend appears to be a form of ‘genomicstreamlining’ occurring in evolutionarily successfullineages that reach high effective population sizes whichprevent effects of genetic drift and eliminate even slightlydeleterious features such as introns. Conversely, theapparent bursts of intron gain linked to the origin ofmajor groups of eukaryotes such as the Metazoa wouldcoincide with population bottlenecks which are typical ofsuch transitional epochs [42,49,50,172]. The non-adaptivepopulation genetic concept is also compatible with thefinding that intron-rich organisms possess much weakerdonor splice signal than intron-poor organisms: the pres-sure of purifying selection in intron-rich lineages is insuffi-cient to strictly maintain the consensus nucleotides at thedonor sites [51]. A more direct analysis that compared therates of consensus-to-variant and variant-to-consensussubstitutions in the donor sites of three intron-richlineages supported the existence of purifying selectionagainst variants that, however, is too weak to maintain theconsensus in most of the introns [52].A major consequence of the inability of purifying

    selection in small populations to eliminate introns or tomaintain strong donor splice signals is the accumulationof aberrant splice variants. Such error-prone splicingcould eventually give rise to functional alternative spli-cing. Notably, the latest scenario of intron gain and lossin widespread eukaryotic genes includes only intron-richintermediates on the path of evolution from the LECAto mammals (see above; Figure 7), with the implication

    that this line of descent never went through a stage ofstrong purifying selection allowing continuous evolutionof alternative splice variants [53].Although the non-adaptive population genetic theory

    appears to be the best available conceptual frameworkfor the evolution of eukaryotic gene architecture, spli-cing and introns, at least two notable problems remainoutstanding. First, it is unclear why the acceptor splicesignal does not follow the same trend as the donor siteand is stronger in intron-rich multicellular eukaryotesthan it is in intron-poor unicellular forms although theobserved positive correlation between the strength ofthe donor splicing signal and the combined strength ofthe branch point signal + acceptor splice signal [117]might explain this incongruence. Second, the preserva-tion of at least a few introns even in the most intron-poor organisms remains enigmatic because at face valuethe non-adaptive scenario would predict complete lossof introns and accordingly the spliceosome in multiplelineages.

    Evolution of alternative splicing in coding andnon-coding regions of eukaryote genesIn multicellular organisms, particularly animals, AS is amajor mechanism for regulating gene expression andfunction [173-176]. Large-scale studies based on map-ping of expressed sequence data on genomic sequencesand RNAseq surveys have shown that more than 90% ofhuman and over 40% of Arabidopsis thaliana and rice

  • Rogozin et al. Biology Direct 2012, 7:11 Page 16 of 28http://www.biology-direct.com/content/7/1/11

    genes are capable of producing multiple diverse mRNAmolecules through alternative splicing [177-183].Alternative splicing has been identified in many

    eukaryotic groups; however, it remains unclear whetherfrequent alternative splicing emerged early in eukaryoticevolution [176,184] because ancestral splice signals wereweak and failed to provide for highly accurate splicing,or has evolved more recently and independently in mul-tiple lineages via mutation of strong ancestral splice sig-nal in multi-intron genes [33]. As pointed out in theprevious section, the non-adaptive population geneticmodel that is in excellent agreement with the empiricalreconstructions of eukaryote gene architecture evolutionimplies that AS evolved already at the earliest stages ofeukaryote evolution through accumulation of aberrantsplice variants under conditions of weak purifying selec-tion. A further implication of this scenario is that ini-tially all alternative transcripts were non-functionalwhereas functional AS evolved gradually and independ-ently in multiple lineages, primarily those that havenever gone through population bottlenecks leading toextensive loss of introns and tightening of splice signals.The impact of alternative splicing on protein function

    has been studied in great detail and is generally recog-nized as a major source of protein diversity that greatlyexpands the repertoire of protein function [173-175]. Asystematic comparison of 9 animal genomes from nema-todes to mammals revealed that intron-flanking domainsexpanded faster than other protein domains [185]. Intri-guingly, such mobile domains exhibited a strong prefer-ence for phase 1 introns [185-188] in contrast to thegeneral excess of phase 0 among introns (Figure 6). Thisfinding suggests that evolution of introns flanking mo-bile domains is fundamentally different from the evolu-tion of introns in conserved portions of genes but thenature of these differences remains to be elucidated[185,187,188].Evolutionary conservation of alternative splicing is a

    controversial matter. Only limited conservation of alter-natively spliced (cassette) exons within mammals andwithin dipterans has been detected [189-193]. However,a strikingly different pattern has been reported for Cae-norhabditis nematodes: more than 92% of cassette exonsfound in C. elegans are conserved in C. briggsae and/orC. remanei [194], The differences in conservation be-tween lineages might reflect differences in the fractionsof functional alternative transcripts but possibly also dif-ferences in intron length and the strength of splicing sig-nals [194].Evolution of alternative splicing has been also analyzed

    in the context of splicing signals [195]. The GT di-nucleotide in the first two intron positions is the mostconserved element of the U2 donor splice signal. How-ever, in a small fraction of donor signals (

  • Rogozin et al. Biology Direct 2012, 7:11 Page 17 of 28http://www.biology-direct.com/content/7/1/11

    nested within introns. In addition, the possibility hasbeen discussed that introns might act as ‘catalysts’ ofevolution by facilitating intergenic recombination (thiscould be considered a variation on the theme of genericnon-coding DNA functions). Experimentally demon-strated and potential functions of introns have beenreviewed in detail [214,215]. Here we do not attempt acomprehensive coverage of this subject but rather brieflydiscuss several aspects that appear directly relevant forunderstanding evolution of introns and eukaryote genestructure.

    Functions of introns associated with splicingSplicing occurs before mature mRNAs are transportedfrom the nucleus to the cytosol by the nuclear exportsystem. Numerous studies indicate that splicing andmRNA export are directly coupled (see reviews [32,35]).Evidence of such coupling initially came from the obser-vation that mRNAs generated by splicing are more effi-ciently exported than their identical counterpartstranscribed from a complementary DNA [216]. This ef-fect of splicing on export was explained by the findingthat spliced mRNAs (but not cDNA transcripts) areassembled into a distinct mRNP complex that promotesefficient export [32,35,216]. This complex, or at leastsome of its components, has been subsequently shownto assemble adjacent to newly formed exon–exon junc-tions [217]. The increased export efficiency of thespliced mRNP is thought to be due to recruitment of themRNA export factor ALY to the mRNA during the spli-cing reaction [218,219]. The splicing factor UAP56,which interacts directly with ALY, plays a role in recruit-ing ALY to the spliced mRNA [220-222]. In a subse-quent step, a hand-off occurs in which the ALY/TAPinteraction is established, thus delivering the mRNP tothe nuclear pore for export [221]. The numerous eukar-yotes that possess only a few introns in the entire gen-ome nevertheless retain a full-fledged or partiallydegraded spliceosome machinery [8,65,223], suggestingthe possibility that the spliceosome might have functionsother than splicing as such, perhaps primarily nucleocy-toplasmic trafficking. However, the transport mechan-isms for numerous intron-less transcripts are not wellcharacterized, and the possibility remains that intron-less RNAs are recruited to the export machinery via aspliceosome-independent route [32,35]. Compatible withthis hypothesis, UAP56 is required for export of bothspliced and intronless mRNAs [220-222,224]. In meta-zoan intronless mRNAs, specific mRNA sequence ele-ments are required for export, and some of theseelements associate with members of the SR family ofsplicing factors which are thought to mediate export ofthe intronless mRNA [225,226]. The SR proteins couldeither recruit the conserved export machinery or play a

    direct role in export [226]. In both yeast and metazoansthe export of intronless mRNAs also could be coupledto polyadenylation [32,35,226,227]. It has been shownthat in mammalian neurons some retained introns arecoupled with targeting of mRNA sequences to dendrites,apparently via so called ID sequences that represent adistinct class of SINE retrotransposons resident in theretained introns [228]. Thus, functionally relevant reten-tion of intronic sequence might be a more generalphenomenon than previously suspected.The speed of splicing could be another important

    mechanism of gene expression regulation [27,28]. Ana-lysis of minor, U12 introns (see above) suggested thattheir positions are conserved in orthologous genes fromhuman and Arabidopsis to an even greater extent thanthe positions of the major, U2 introns [29]. The U12introns, especially conserved ones, are concentrated in5'-portions of plant and animal genes, whereas the U12to U2 conversion occurs preferentially in the 3'-portionsof genes. These results are compatible with the hypoth-esis that the high level of conservation of U12 intronpositions and their persistence in genomes, despite theunidirectional U12 to U2 conversion, have to do withthe role of the slowly excised U12 introns in down-regulation of gene expression [27-29,229].As already pointed out above, introns in yeast riboso-

    mal protein genes substantially affect the expression ofthese genes and contribute to the organismal fitness andstress response via mechanisms that are not yet wellunderstood [57]. These seminal findings indicate that inmany cases the regulatory functions of introns could bespecific to a class of genes or even an individual gene.This conclusion is compatible with the results of anearlier study which has shown that yeast spliceosomecan distinguish between different transcripts includingrelated ones, such as paralogous ribosomal proteingenes, thus providing a distinct regulation mode for ex-pression of specific proteins [230].

    Introns as functionally important non-coding DNAsequencesCompared to prokaryotes, eukaryotes possess a muchgreater number of multidomain proteins that substan-tially contribute to the functional complexity of theeukaryotic cell [187,188,231-234]. Moreover, a strikingfeature of eukaryotic protein architectures is the widespread of the so-called promiscuous domains that com-bine with other domains much more often thanexpected by chance [234,235]. The ‘exon theory’ positsthat exon shuffling via recombination within introns isan important route of evolution that in particular is re-sponsible for the diversity of the domain architectures ofmultidomain proteins [39,40,236]. In the specific case ofvertebrate membrane receptor proteins, this hypothesis

  • Rogozin et al. Biology Direct 2012, 7:11 Page 18 of 28http://www.biology-direct.com/content/7/1/11

    seems to be compatible with empirical observations:these proteins consist of multiple modules each of whichtypically is encoded by an individual exon [185,187,188].However, in other classes of proteins, there is no strongpreference for intron location between domains, so exonshuffling is unlikely to be a major, general mechanism ofmultidomain protein evolution [43,135,185,187,188,234].Introns have the potential to serve as “enhancers” of

    meiotic crossing-over occurring within protein-codinggenes because the probability of crossing over betweensegments of a coding sequence (exons) separated by longintrons greatly increases compared to the same codingsequences in the absence of an intron [214,237]. This mei-otic recombination between exons of two alleles of thesame gene is likely to be a major factor of protein evolu-tion through combining mutations from different alleles,“trying out” different combinations and avoiding accumu-lation of deleterious mutations within the same allele[1,214,237].Trans-splicing is a special form of RNA processing

    whereby exons from two different primary RNA tran-scripts are joined end-to-end and ligated. The most com-mon form of trans-splicing is spliced-leader (SL) trans-splicing where the leader is donated by a short SL RNA.The SL trans-splicing is widespread among some unicellu-lar eukaryotes, in particular trypanosomes [238]. Otherthan trypanosomes, the only organisms known to heavilyrely on SL trans-splicing for gene expression are nema-todes. More than half of the pre-mRNAs in the Caenorha-biditis nematodes are trans-spliced to one of two shortleader RNAs, SL1 or SL2. This process occurs at the 5'ends of pre-mRNAs, and it is essential for the efficientprocessing of polycistronic pre-mRNAs [35,239-242]. Thepatchy distribution of trans-splicing suggests that SLtrans-splicing has evolved repeatedly among eukaryoticlineages and SL precursor RNAs have readily evolvedfrom ubiquitous small nuclear RNAs that are involved inconventional splicing [243]. Several cases of trans-splicingbetween different pre-mRNAs (no SL RNAs are involved)have been identified in tunicates, mammals, flies andplants (reviewed by [214,242,244,245].

    Functional elements and genes within intronsSome introns contain various regulatory elements aswell as sequences involved in chromatin structure for-mation such as scaffold-matrix attachment regions, al-though it remains uncertain whether intron sequencesshow any substantial enrichment for regulatory andstructural elements compared to other non-coding DNA[214,246]. Some long introns, especially those in 5’-terminal regions of coding sequences, might be enrichedfor various regulatory elements, and consequently, couldbe subject to purifying selection [160,247-253]. Longintrons in several genes of Oikopleura have been shown

    to contain key developmental regulators [131], and simi-lar observations have been reported for genes involvedin development of diverse metazoans [254-257] as wellas associated “bystander” genes that are not known to bedirectly involved in development [258-261].Many introns contain within their sequences various

    non-coding RNA genes, especially numerous genes forsnoRNAs [262,263] and precursors of microRNAs[264,265]. Specifically, some short animal introns withhairpin formation potential, known as mirtrons, can bespliced and debranched into pre-miRNAs [266-268].These pre-miRNAs are then cleaved by the RNase IIIenzyme Dicer and incorporated into typical miRNA si-lencing complexes [268,269].A small fraction of introns contain nested protein-

    coding genes [270]. Comparative analysis of these nestedgenes in vertebrates, fruit flies and nematodes revealedsubstantially higher rates of gain of intron-embeddedgenes compared to loss [271]. However, the accumula-tion of nested gene structures is likely to represent anincrease of organizational complexity of animal genomesvia a neutral process given that there seem to be nofunctional links between the intron-contained genes andthe ‘host’ genes [271]. Effectively, it seems that intronsserve as neutral substrate that can be randomly colo-nized by various genes.

    Molecular mechanisms of intron loss and gainMechanisms of intron loss and gain remain poorly under-stood. A plausible, common mechanism for intron losscould be homologous recombination between cDNAs thatare produced by reverse transcription and the genomiccopies of the respective genes [65,67,107-110,112]. Introngain/loss events must be associated with a transient phaseof segregating alleles either carrying or lacking the intronwithin natural populations [49]. Until now, only 25 tran-sient intraspecific intron presence-absence polymorphismshave been reported, one in Drosophila teissieri [272] and24 in Daphnia pulex [70,96]. In Daphnia, recently gainedintron sequences were frequently associated with shortrepeats, suggesting a role for double-strand break repair inintron gain [96]. Analysis of several closely-related fungirevealed 74 presence-absence polymorphisms of introns[273]. Examination of the positions of these introns hassuggested that extensive intron trans