evolution of the u2 spliceosome for processing …...the non-canonical (and canonical) introns of f....
TRANSCRIPT
Article
Evolution of the U2 Splice
osome for ProcessingNumerous and Highly Diverse Non-canonical Intronsin the Chordate Fritillaria borealisGraphical Abstract
Highlights
d F. borealis has lost most of its old introns, and it gained new
ones by transposition
d New introns do not conform to the GT/AG rule, and they
display various splice sites
d The U2 spliceosome is conserved and responsible for
removing the new introns
d Larvacean tunicates have evolved specific mechanisms to
remove non-GT/AG introns
Henriet et al., 2019, Current Biology 29, 3193–3199October 7, 2019 ª 2019 Elsevier Ltd.https://doi.org/10.1016/j.cub.2019.07.092
Authors
Simon Henriet, Berta Colom Sanmartı,
Sara Sumic, Daniel Chourrout
In Brief
The origins of introns in eukaryotes is a
mystery, and the majority have nothing in
common except GT/AG ends that play a
crucial role during splicing. Henriet et al.
show that transposition creates a large
amount of new introns with non-GT/AG
splice sites. A spliceosome with
conserved components, but with a new
selectivity, removes these introns.
Current Biology
Article
Evolution of the U2 Spliceosome for ProcessingNumerous and Highly Diverse Non-canonical Intronsin the Chordate Fritillaria borealisSimon Henriet,1 Berta Colom Sanmartı,1 Sara Sumic,1 and Daniel Chourrout1,2,3,*1Sars International Centre for Marine Molecular Biology, University of Bergen, 5006 Bergen, Norway2Key Laboratory of Marine Genetics and Breeding, Ocean University of China, Ministry of Education, Qingdao 266003, China3Lead Contact*Correspondence: [email protected]
https://doi.org/10.1016/j.cub.2019.07.092
SUMMARY
An overwhelming majority of eukaryotic introns haveGT/AG ends, whose identities play a critical role fortheir recognition and removal by the U2 spliceosome,a well-conserved complex of protein and RNAs. In-trons with other splice sites exist at very low fre-quencies in various genomes, and some of them areprocessed by the U12 spliceosome. Here, we showthat, in the chordate Fritillaria borealis, the majorityof old introns have been lost and replaced by intronswith highly divergent splice sites. The new introns ofF. borealis are exceptionally diverse, though morefrequentlyAG/ACorAG/AT,and featuresof thousandsof them support an origin from transposons. Theycannot be processed in human cells, but their splicingis rescued by mutating terminal dinucleotides to GT/AG. With lariat sequencing and splicing inhibitor as-says, we show that F. borealis introns are spliced bythe U2 spliceosome, which thus evolved to a differentselectivity, with neither novel U1 small nuclear RNA(snRNA) types nor major remodeling of its proteinand snRNA complements. This genome-wide recolo-nization by non-canonical introns emphasizes theimportance of transposons as a resource of novel in-trons in a context of massive intron loss. An evolutionof thespliceosomemayalsopermit toneutralizeharm-ful transposons through their conversion into introns.
INTRODUCTION
Spliceosomal introns are generally stable over long evolutionary
times [1], but in some rapidly evolving lineages, many old introns
have been lost and new introns with canonical splice sites have
been gained. In the tunicate larvacean Oikopleura dioica, an
important turnover of introns was revealed and possibly ex-
plained by at least two distinct mechanisms of intron gains [2].
Notably, a significant minority of introns had slightly modified
G(non-T)/AG splice sites, together with characteristic size distri-
butions and an unusual A-rich tail. More drastic changes of
intron-exon organization were observed outside metazoans,
Current
such as in two species of microalgae, for which thousands of in-
trons show features of transposable elements, such as repeated
and palindromic sequences [3]. How such new introns arose
from intragenic insertions was not addressed, presumably
because their splice siteswere either canonical (GT/AG) or nearly
canonical (GC/AG). In a limited sample of introns from the unicel-
lular algae Euglena gracilis, introns with more divergent splice
sites were found and they also seem to originate from transpons
[4, 5]. They were named ‘‘nonconventional’’ as opposed to spli-
ceosomal GT/AG, even though no experiment permits to show
for either type the involvement or lack of involvement of the spli-
ceosome [5]. This issue is central for the case that we report
here, as most introns of fully sequenced tunicate genomes
also acquired from transposable elements have highly divergent
splice sites. Our results support that an evolution of the splicing
machinery was required for transforming transposons into in-
trons on a genome scale.
RESULTS
Most Introns of Fritillaria borealis Are Non-Canonicaland Were Acquired from TransposonsWhen studying the newly sequenced genomes of eight larva-
cean species [6], we discovered that the vast majority of
F. borealis introns have non-canonical splice sites (Figures 1A,
1B, and S1). For their precise delimitation, we developed and im-
plemented an ad hoc approach, which also detects canonical in-
trons and the rare non-canonical introns of other genomes (see
STARMethods and Figure 7A). In F. borealis, the most abundant
intron types are AG/AC and AG/AT, but a broad diversity of other
splice site combinations is also found. Overall, F. borealis genes
have fewer introns than in other animal species, and most of
them have one or more non-canonical introns (Figure S2). Half
of the minority GT/AG introns have positions conserved beyond
tunicates, supporting a relatively ancient origin (Figure S2). Non-
canonical introns have species-specific positions in the coding
sequences, and their origin from transposable elements is sup-
ported by observing that they have (1) several to many copies
in the genome (Figures 1A, 1B, S3, and S4), (2) a distinctive
and narrow size distribution (Figure 1A), (3) terminal inverted re-
peats (TIRs) that can be identical (Figures 1B, S3, and S4), and (4)
short flanking direct repeats evoking target site duplications (Fig-
ures 1A, 1B, S1, S4B, and S4C). Most non-canonical introns are
indeed preceded by an exonic triplet TAC (or less frequently TAT;
Biology 29, 3193–3199, October 7, 2019 ª 2019 Elsevier Ltd. 3193
Figure 1. Main Features of F. borealis
Introns
(A) Left side: classification of 5,645 introns based
on transcriptome and genome alignment and
annotation after longest open reading frame (ORF)
orientation (large pie). Small pies represent for
each intron type the relative proportions of TAC or
TAT exonic triplets preceding the intron, which are
in majority for all non-canonical types and in small
minority for canonical introns. Right side, top:
distribution of intron sizes for each main category
of introns is shown. The very large majority of AG/
AN introns are 100–250 bp long, an important part
of GT/AG introns are longer, and GC/AN introns
have an intermediate size distribution. Right side,
bottom: relative repetition level in each intron
category based on pairwise alignments of all in-
trons is shown. AG/AN introns are by far those
having more homologs, and GT/AG introns have
virtually none.
(B) Logos of intron sequences, most of which are
non-canonical, with the characteristic TAC or TAT
pre-intron triplet. Canonical introns have a T-rich
tail, in contrast to non-canonical ones. The third
highly conservative logo for 107 repeated introns
of one subgroup displays palindromic ends.
See also Figures S1, S2, S3, and S4 and Table S2.
Figure S4D), compatible with the duplication of a TA target site,
which generated the intron 30 end. These features altogether
support an origin from MITEs (miniature inverted-repeat trans-
posable elements) integrating after a TA site (Figure S4E), as
Tc1/Mariner transposons do [7]. An unbiased survey led to
retrieve 10,295 non-redundant copies of MITEs in the genome
of F. borealis [6]. Homology search showed sequence similarities
of 6,015 of them with 3,609 annotated introns. In these intron-
related MITEs, 939 are part of a collection of 19,214 introns iden-
tified by genome to transcriptome alignment. By comparing the
protein-coding potential of their flanking regions with those of
annotated introns (BLASTX on UniProtKB/SwissProt), we esti-
mate that another 15% of the MITEs not validated by alignments
with transcriptsmay also be introns. Themechanisms of intron or
MITE mobilization in F. borealis are still unclear, as DNA transpo-
sons present in the genome did not show conspicuous similarity
with TIRs present at intron and MITE ends.
3194 Current Biology 29, 3193–3199, October 7, 2019
F. borealis Non-Canonical Introns Are SpliceosomalBecause a presence/absence polymorphism was revealed for
recently gained introns [3], care should be taken to not mistake
ordinary genomic insertions for introns. The compelling argu-
ment for non-canonical insertions into F. borealis genes to repre-
sent real introns is the identification of splicing products (Fig-
ure 2A). The first step of splicing ligates the branch point to the
50 end of the intron, thus producing an intron lariat that is
released after the second step of splicing. We were able to
detect lariats by sequencing RNA resistant to Rnase R [8] and
aligning the reads with our collection of putative introns. Lariat
production was detected for 111 introns in O. dioica and for 44
introns in F. borealis, including 40 non-canonical ones. Lariat se-
quences in every case confirmed the proposed 50 splice site and
pinpointed a branch point very near the 30 splice site (<13 nt; Fig-
ure 2B). Lariats may also be produced by self-splicing group II
and group III introns, which were shown to exist in some animal
Figure 2. Identification of Splicing Interme-diates with Lariat-Seq
(A) Outline of the procedure. RNase R digests lariat
30 tail, leaving circular RNA with a 20-50 bond at the
branchpoint (BP) (black dot). During cDNA syn-
thesis (dotted lines), the BP is an error-prone po-
sition (pie charts show error rate for adenine BPs).
Reads are mapped back to the genome, and BP
position is inferred from split alignment.
(B) Bar graphs represent the BP identity for lariats
identified in F. borealis and O. dioica.
(C) BP distance from the 30 end of the intron.
Figure 3. Splicing Inhibitor Assay
(A) RT-PCR was used to visualize splicing effi-
ciency for seven F. borealis introns in specimens
treated with 10 mMPladienolide B (PlaB), as well as
non-treated animals (DMSO). Primers matched the
flanking exons. Red arrows show products con-
taining an intron; black arrows show splicing
products (lanes 1 and 2, cDNA; lanes 3 and 4,
control without RT; lane 5, control with water; lane
6, control with genomic DNA).
(B) Genome browser view showing RNA-seq
coverage over two genes, in PlaB treated and non-
treated specimens (log scale). Exon coverage is
comparable in the two samples, whereas intron
coverage (indicating intron retention) is increased
by PlaB treatment.
(C) Genome-wide intron retention rate measured
on RNA-seq reads with or without PlaB treatment
(see details in STAR Methods). Only introns with flanking exons covered by at least 50 sequencing reads in control and in treated samples are considered here.
High retention rates are induced by the PlaB treatment.
genomes [9] and whose activity relies on conserved RNA motifs
and an IEP protein. These features were found neither in
F. borealis introns nor in its genome sequence. In a comple-
mentary experiment (Figure 3), the splicing of both canonical
and non-canonical introns was partially inhibited in F. borealis
specimens treated with Pladienolide B (PlaB), a drug known
to interfere with the recognition of branchpoint by the SF3B
complex during pre-mRNA splicing [10, 11]. The global analysis
of all RNA-seq) reads encompassing exon-intron borders
(381,722 for DMSO control and 266,948 for PlaB treated)
Figure 4. Splicing Assays in HEK293T Cells
(A) Twelve introns from six F. borealis genes were cloned in a mammalian expr
introns was tested with RT-PCR. Stars indicate the three introns modified with in
(B) Effect of splice site mutation on intron removal. For three distinct introns, splice
Arrows show primers, red bars indicate at which position mutations were introd
which splice sites were mapped, and the fractions indicate their frequency (lowe
(C) Gel showing a representative RT-PCR experiment for detecting the splicing of
been modified (lanes mut1–mut5).
See also Table S1.
showed an increase of intron retention rate in treated speci-
mens, up to levels that were similar for canonical (from 2%
up to 11.4%) and non-canonical introns (from 1.4% up to
12.9%). The analysis of 1,067 introns, for which at least 50
RNA-seq reads were exploitable in both control and treated
groups, showed that intron retention rates could reach much
higher levels (Figures 3B and 3C), as confirmed with RT-PCR
assays (Figure 3A). All these results concur to support that
the non-canonical (and canonical) introns of F. borealis are
spliceosomal.
ession vector. After cell transfection, splicing of canonical and non-canonical
vitro mutagenesis.
sites mutants (mut1–5) were produced and splicing was tested using RT-PCR.
uced, and black bars indicate cryptic splice sites. The connecting lines show
rcase, exon sequence; uppercase, intron sequence).
either the wild-type intron Fb6nc (lane WT) or mutants whose splice sites have
Current Biology 29, 3193–3199, October 7, 2019 3195
Figure 5. Characterization of the Spliceo-
some Components in F. borealis
(A) Two U1 variants were found (arrows show
SNPs), with conserved 50 end and stem-loop mo-
tifs. In stem Ic of snRNA U5, the base pairs prox-
imal to loop I are strictly conserved in chordates
but have diverged in fritillarids. In snRNA U6, res-
idues predicted to interact with U2 and U4 are
highlighted. Residues colored in red are conserved
within chordates.
(B) Cap-dependent RNA-seq. The list shows the
most abundant RNAs present in the input, in the
100- to 300-nt range. Note that it does not include
the splice leader RNA (SL RNA), whose size is
under 100 nt. Cap enrichment is determined as the
increase in read amount caused by decapping
prior to 50 adaptor ligation.(C) RIP-seq experiments with anti-TMG and anti-
Sm immunoglobulin Gs (IgGs). We compared each
immunoprecipitation (IPP) to a control (Ctl) per-
formed with pre-immune serum. Dots show reads
abundance for exons (gray), introns (yellow),
snRNA (red), U3 (pink), and rRNA (blue). Dark dots
represent background.
(D) Orthologs of spliceosomal proteins found with
genome mining, grouped either by snRNP or by
function during pre-mRNA processing.
(E) The maximum-likelihood phylogeny shows the
relationships between four groups of SR proteins
(SRSF2/8, SRSF3/7, SRSF1/9, and SRSF4/5/6).
Nodes with bootstrap values over 0.7 are shown
with red circles; the scale bar indicates the number
of amino acid changes per position.
See also Figures S5 and S6.
F. borealis Has Evolved a Specific Ability to ProcessNon-Canonical IntronsWhether F. borealis introns can or cannot be processed by a
conserved spliceosome was addressed in splicing assays.
Gene fragments from F. borealis containing canonical or non-ca-
nonical introns, native or in vitro mutated, were introduced into
human HEK293T cells, and their transcripts were characterized
(Figure 4). Although canonical introns of F. borealis were accu-
rately removed, its non-canonical introns remained unspliced
(Figure 4A). For one non-canonical intron construct, splicing
did in fact occur but at a cryptic canonical splice site. We then
tested whether changing 50 and 30ss identity could rescue the
splicing of three F. borealis AG/AT introns (Figure 4B). For each
type of mutation, the results showed variations among introns,
but the overall pattern was as follows: (1) sites left non-canonical
were not used, (2) splicing was most often detected if one of the
sites was rendered canonical, and (3) splicing happened exactly
at the intron borders only when both sites were canonical,
though other alternative canonical sites could also be selected.
Overall, the splicing of F. borealis introns in human cells is depen-
dent on the presence of terminal dinucleotides GT/AG. These re-
sults support that the spliceosome of F. borealis evolved new
properties for accurately processing non-canonical introns.
F. borealis Possesses a Single Spliceosome of the U2TypeMost GT/AG introns are spliced out by the U2 spliceosome,
which comprises five small nuclear ribonucleoprotein particles
3196 Current Biology 29, 3193–3199, October 7, 2019
(snRNPs), corresponding to snRNA U1–U6 and several other
splicing factors [12]. All snRNAs detected in the F. borealis
genome were of the U2 type (Figure S5A) and not of the U12
type, as earlier observed for the tunicate larvacean O. dioica
[13]. In the classical model, the 50 splice site is recognized by
the complementaryU1 snRNA 50 end [14]. In F. borealis, we found
twoU1 snRNA genes (Figure 5A) with well conserved 50 ends thatmatch canonical but not non-canonical splice sites. The only
change found in snRNAs is aweakly conserved stem Ic inU5 (Fig-
ures 5A andS5B),whichmight affect the helical conformation and
the contacts between Loop I and the pre-mRNA [15]. However,
the co-transfection of F. borealisU5 snRNA with gene constructs
containingnon-canonical introns intoHEK293Tcells didnot result
in their splicing. Because standardgenomeminingmaynot reveal
highly divergent snRNAs, we experimentally explored the ncRNA
complement for candidates presenting known features of
snRNAs, i.e., a short size, high expression levels, presence of 50
tri-methylguanosine cap (TMG), and a predicted ability to bind
Sm proteins (Figures 5B and 5C). We sequenced 100- to 300-
base-longRNAs,whose50 end is resistant toexonucleasebuteffi-ciently ligated after decapping.We also performed RIP-seq using
antibodies against TMGor the Sm antigen. Both approaches effi-
ciently identified all snRNAs of the U2 type but none that might
recognize non-canonical introns through sequence complemen-
tarity.Wealso identifiedF. borealis candidate orthologs of 71 pro-
teinsknown toparticipate inspliceosome function [12] (Figure5D).
The predicted sequences of proteins involved in splice site recog-
nition showed strong conservation between F. borealis and other
Figure 6. Frequency of Main Intron Types and Intron Densities
across Deuterostomes
Emphasis is on newly sequenced genomes of larvacean tunicates, all other
species showing a large majority of canonical introns. Scale bar, number of
substitution per site. Pies indicate the relative frequencies of intron types:
yellow area for canonical introns; blue area for nearly canonical introns G(non-
T)/AG; and green area for non-canonical introns. Nearly canonical introns have
relatively high frequencies in oikopleurids, and other non-canonical introns
dominate in fritillarids. Very low frequencies of non-canonical introns are
confirmed in human (H. sapiens), amphioxus (B. floridae), ascidian
(C. intestinalis), and sea urchin (S. purpuratus). All counts result from align-
ments of transcriptomes with genome sequences, except for O. longicauda,
A. sicula, and F. pellucida, for which introns were localized and annotated after
visual inspection of highly conserved gene sequences. Intron densities were
calculated for conserved coding regions of 109 genes from six species,
including F. borealis. For A. sicula, whose genome sequencing coverage is low
(approx. 43) and transcriptome is not available, a subset of 40 of these 109
genes was used and introns localized using TBLASTN and query sequence
from human and F. borealis (star).
See also Figure S7.
species. Amongproteins of theU1andU2snRNP, sequence sim-
ilarity with human orthologous proteins ranged from 63.7% to
85.5% in well-conserved domains. Higher divergence was
observed in low-complexity regions, such as the C terminus of
U1andU2proteins that arealso lessconserved inothergenomes.
Intriguing exceptions were deviations in highly conserved motifs
of U1C and Prp8 (Figures S6A and S6B) and an expansion of
SRSF2/8 genes with conserved RRMs but a variable distribution
of R(X) repeats (Figures 5E and S6C). Because SRSF2 plays an
important role in splice site selection, this expansion could be
investigated further in relation with the processing of new intron
types [16–18].
The Burst of Introns with Highly Divergent Splice SitesDates from the Radiation of Tunicate LarvaceansAn interesting question is when did the ability to process
numerous non-canonicals arise? First, we calculated the
frequencies of intron types in a broad panel of species from
sea urchin to human, based on genome to transcriptome align-
ments. We could confirm for all of them a very small percentage
of non-canonical introns (Figure 6), including in the tunicate
ascidian Ciona intestinalis. The frequency of divergent intron
types becomes notable in tunicate larvaceans, for which we
have sequenced three new genomes of oikopleurids
(O. albicans, O. vanhoeffni, and O. longicauda) and two addi-
tional genomes of fritillarids (F. pellucida and Appendicularia
sicula). High frequencies and again a broad diversity of non-ca-
nonical introns are observed for fritillarids (Figure 6). Among the
three fritillarid species, there are differences in the types and
sizes of non-canonical introns but also key similarities, like the
prevalence of TAT/TAC pre-intron exonic triplets, which sug-
gests an origin from transposons (Figure S7). Interestingly, all
four genomes of oikopleurids show an important minority of
G(non-T)/AG introns, which were previously revealed for
O. dioica, but not investigated further (Figure 6). After transfec-
tions of HEK293T cells with O. dioica intron constructs (Table
S1), canonical introns were spliced out and nearly canonical
ones were not. This suggests for O. dioica too an adaptation of
the splicing machinery. We conclude that, within tunicates, lar-
vaceans at large have developed specific abilities to process
new intron types.
DISCUSSION
Our results reveal that invasions of host genes by transposons
produced numerous non-canonical introns, whose removal ne-
cessitates an adapted splicing machinery. Larvacean tunicates
have experienced a particularly rapid evolution, reflected by
long branches in phylogenetic trees and important genomic
changes [2, 3, 6]. Among those, highly frequent changes of intron
positions were revealed when analyzing the genome ofO. dioica.
Massive losses of old introns were interpreted as the outcome of
mRNA-mediated mechanisms (reverse transcription followed by
homologous or illegitimate genome integration) [2]. For the gain
of new introns, the features of forty introns were found compat-
ible with an origin via at least two mechanisms, reverse splicing
and transposon insertion [2]. The respective contribution of
those cannot be apprehended, because intron sequence evolu-
tion is too rapid for preserving similarities with intron donor
elements.
Large effective population size was supported for O. dioica by
genome data [2], making genetic changes more likely to be
adaptive. Tuning down the selectivity of the spliceosome for pro-
cessing non-canonical introns is in principle not without risk, as it
could also augment the frequency of incorrect splicing. An evo-
lution of the spliceosome selectivity must have conferred selec-
tive advantages able to compensate for such risks. A spliceo-
some that is less dependent on splice site identity could have
favored the ‘‘neutralization’’ of a broad variety of intragenic inser-
tions via their removal from pre-mRNAs. Because the spliceo-
some selectivity here appears little dependent on the intron
ends, accurate splicing may rely on other intron-associated fea-
tures, such as other sequence motifs, epigenetic marks, or RNA
structure. Our analysis has not recognized new splicing signals in
introns or in the flanking exons, and we found a surprisingly well-
conserved U1 snRNA, including in its region that matches the 50
Current Biology 29, 3193–3199, October 7, 2019 3197
Figure 7. Forces That May Drive the
Replacement of Canonical by Non-Canoni-
cal Introns
A genome segment containing four genes is fol-
lowed along the process, with exons in blue color.
Transposable elements insertions are represented
with pairs of opposite arrows, in gray for intergenic
insertions, in black for disruptive insertions into
exons, and in red for insertions that have been
intronized.
splice site in other organisms. This conservation is probably
required for splicing the few GT/AG introns left in the
F. borealis genome. It does not exclude a role of U1 snRNA in
the splicing of non-canonical introns, perhaps using an alterna-
tive base pair register [19]. In human cells that possess well-
conserved U1 snRNA, non-canonical introns of F. borealis
were not spliced. We therefore assume that, if U1 snRNA recog-
nizes non-canonical introns, this occurs with the assistance of
other factors that are specific to F. borealis.
The homology of MITEs and introns is probably eroded with
time, as shown by the degeneration of their palindromic struc-
tures, but this does not mask the strong relationship existing be-
tween the F. borealis complements of MITEs and of introns. As
indicated in Figure 6, the intron density is relatively low in larva-
ceans and very low in fritillarids whose genes received a small
number of non-canonical introns. There, only a very tiny fraction
(around 1%) of ancient introns was retained. In this context of
massive intron loss, MITEs thus became the essential supplier
of introns, perhaps the only possible one. Whatever the reasons
for the settlement of non-canonical introns, its historical dy-
namics is an exciting question. Before non-canonical introns
could be spliced out, MITE insertions into genes must have
been limited in number. The risk of an excessive mutation load
may have fostered the evolution of new intron recognition mech-
anisms, which, when in place, have allowed a secondary coloni-
zation of their coding sequences (Figure 7).
STAR+METHODS
Detailed methods are provided in the online version of this paper
and include the following:
d KEY RESOURCES TABLE
d LEAD CONTACT AND MATERIALS AVAILABILITY
d EXPERIMENTAL MODEL AND SUBJECT DETAILS
d METHOD DETAILS
3198
B Genome assembly
B Transcriptome assembly and intron annotation
B Lariat-seq
B Splicing inhibitor assay
B Cap-dependent RNA-seq
B RNA-immunoprecipitation
B Splicing assays in mammalian cells
Current Biology 29, 3193–3199, October 7, 2019
B Homology searches
d QUANTIFICATION AND STATISTICAL ANALYSIS
B RNA-immunoprecipitation
B Splicing inhibitor assay
d DATA AND CODE AVAILABILITY
SUPPLEMENTAL INFORMATION
Supplemental Information can be found online at https://doi.org/10.1016/j.
cub.2019.07.092.
ACKNOWLEDGMENTS
We thank Anne Aasjord and Magnus Reeve for excellent technical assistance
in the Sars Centre Oikopleura facility, as well as Don Deibel (Memorial Univer-
sity of Newfoundland), Fabien Lombard and Gaby Gorsky (Observatoire Oce-
anologique de Villefranche sur Mer), and Linda Holland (Scripps Institution of
Oceanography) for organizing the collection of species. We thank Martin
Chourrout for help in statistical analysis and anonymous reviewers for their
helpful suggestions. We thank the Genecore facility of EMBL (Heidelberg) for
most Illumina sequencing. This project has been funded by two major grants
of the Research Council of Norway, of whichD.C. is the PI: 250005 accelerated
evolution in chordates and the origin of larvaceans and 234817 Sars Interna-
tional Centre for Marine Molecular Biology Research, 2013–2022.
AUTHOR CONTRIBUTIONS
D.C. conceived the study, D.C. and S.H. designed the experiments, S.H. and
B.C.S. performed the experiments, S.S. assembled the genomes and the tran-
scriptomes, D.C. performed most genomic analysis, S.H. annotated spliceo-
somal RNA and protein genes, and D.C. and S.H. wrote the manuscript.
DECLARATION OF INTERESTS
The authors declare no competing interests.
Received: March 20, 2019
Revised: June 27, 2019
Accepted: July 31, 2019
Published: September 19, 2019
REFERENCES
1. Raible, F., Tessmar-Raible, K., Osoegawa, K., Wincker, P., Jubin, C.,
Balavoine, G., Ferrier, D., Benes, V., de Jong, P., Weissenbach, J., et al.
(2005). Vertebrate-type intron-rich genes in the marine annelid
Platynereis dumerilii. Science 310, 1325–1326.
2. Denoeud, F., Henriet, S., Mungpakdee, S., Aury, J.M., Da Silva, C.,
Brinkmann, H., Mikhaleva, J., Olsen, L.C., Jubin, C., Canestro, C., et al.
(2010). Plasticity of animal genome architecture unmasked by rapid evolu-
tion of a pelagic tunicate. Science 330, 1381–1385.
3. Huff, J.T., Zilberman, D., and Roy, S.W. (2016). Mechanism for DNA trans-
posons to generate introns on genomic scales. Nature 538, 533–536.
4. Canaday, J., Tessier, L.H., Imbault, P., and Paulus, F. (2001). Analysis of
Euglena gracilis Alpha-, Beta- and Gamma-Tubulin Genes: Introns and
Pre-mRNA Maturation. Mol. Genet. Genomics 265, 153–160.
5. Gumi�nska, N., P1echa, M., Zakry�s, B., and Milanovski, R. (2018). Order of
Removal of Conventional and Nonconventional Introns from Nuclear
Transcripts of Euglena gracilis. PLoS Genet. 14, e1007761.
6. Naville, M., Henriet, S., Warren, I., Sumic, S., Reeve, M., Volff, J.N., and
Chourrout, D. (2019). Massive changes of genome size driven by expan-
sions of non-autonomous transposable elements. Curr. Biol 29, 1161–
1168.e6.
7. Tellier, M., Bouuaert, C.C., and Chalmers, R. (2015). Mariner and the ITm
superfamily of transposons. Microbiol. Spectr 3, MDNA3-0033-2014.
8. Suzuki, H., Zuo, Y., Wang, J., Zhang, M.Q., Malhotra, A., and Mayeda, A.
(2006). Characterization of RNase R-digested cellular RNA source that
consists of lariat and circular RNAs from pre-mRNA splicing. Nucleic
Acids Res. 34, e63.
9. Valles, Y., Halanych, K.M., and Boore, J.L. (2008). Group II introns break
new boundaries: presence in a bilaterian’s genome. PLoS ONE 3, e1488.
10. Cretu, C., Agrawal, A.A., Cook, A., Will, C.L., Fekkes, P., Smith, P.G.,
Luhrmann, R., Larsen, N., Buonamici, S., and Pena, V. (2018). Structural
basis of splicing modulation by antitumor macrolide compounds. Mol.
Cell 70, 265–273.e8.
11. Kotake, Y., Sagane, K., Owa, T., Mimori-Kiyosue, Y., Shimizu, H., Uesugi,
M., Ishihama, Y., Iwata, M., and Mizui, Y. (2007). Splicing factor SF3b as a
target of the antitumor natural product pladienolide. Nat. Chem. Biol. 3,
570–575.
12. Wahl, M.C., Will, C.L., and Luhrmann, R. (2009). The spliceosome: design
principles of a dynamic RNP machine. Cell 136, 701–718.
13. Marz, M., Kirsten, T., and Stadler, P.F. (2008). Evolution of spliceosomal
snRNA genes in metazoan animals. J. Mol. Evol. 67, 594–607.
14. Zhuang, Y., and Weiner, A.M. (1986). A compensatory base change in U1
snRNA suppresses a 50 splice site mutation. Cell 46, 827–835.
15. McGrail, J.C., andO’Keefe, R.T. (2008). The U1, U2 andU5 snRNAs cross-
link to the 50 exon during yeast pre-mRNA splicing. Nucleic Acids Res. 36,
814–825.
16. Pandit, S., Zhou, Y., Shiue, L., Coutinho-Mansfield, G., Li, H., Qiu, J.,
Huang, J., Yeo, G.W., Ares, M., Jr., and Fu, X.D. (2013). Genome-wide
analysis reveals SR protein cooperation and competition in regulated
splicing. Mol. Cell 50, 223–235.
17. Shepard, P.J., and Hertel, K.J. (2009). The SR protein family. GenomeBiol.
10, 242.
18. Tarn, W.Y., and Steitz, J.A. (1994). SR proteins can compensate for the
loss of U1 snRNP functions in vitro. Genes Dev. 8, 2704–2717.
19. Roca, X., Krainer, A.R., and Eperon, I.C. (2013). Pick one, but be quick: 50
splice sites and the problems of too many choices. Genes Dev. 27,
129–144.
20. Brozovic, M., Dantec, C., Dardaillon, J., Dauga, D., Faure, E., Gineste, M.,
Louis, A., Naville, M., Nitta, K.R., Piette, J., et al. (2018). ANISEED 2017:
extending the integrated ascidian database to the exploration and evolu-
tionary comparison of genome-scale datasets. Nucleic Acids Res. 46 (D1),
D718–D725.
21. Finn, R.D., Coggill, P., Eberhardt, R.Y., Eddy, S.R., Mistry, J., Mitchell,
A.L., Potter, S.C., Punta, M., Qureshi, M., Sangrador-Vegas, A., et al.
(2016). The Pfam protein families database: towards a more sustainable
future. Nucleic Acids Res. 44 (D1), D279–D285.
22. Nawrocki, E.P., Burge, S.W., Bateman, A., Daub, J., Eberhardt, R.Y.,
Eddy, S.R., Floden, E.W., Gardner, P.P., Jones, T.A., Tate, J., and Finn,
R.D. (2015). Rfam 12.0: updates to the RNA families database. Nucleic
Acids Res. 43, D130–D137.
23. Kumar, S., Jones,M., Koutsovoulos, G., Clarke,M., andBlaxter, M. (2013).
Blobology: exploring raw genome data for contaminants, symbionts and
parasites using taxon-annotated GC-coverage plots. Front. Genet. 4, 237.
24. Bolger, A.M., Lohse, M., and Usadel, B. (2014). Trimmomatic: a flexible
trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120.
25. Bankevich, A., Nurk, S., Antipov, D., Gurevich, A.A., Dvorkin, M., Kulikov,
A.S., Lesin, V.M., Nikolenko, S.I., Pham, S., Prjibelski, A.D., et al. (2012).
SPAdes: a new genome assembly algorithm and its applications to sin-
gle-cell sequencing. J. Comput. Biol. 19, 455–477.
26. Rice, P., Longden, I., and Bleasby, A. (2000). EMBOSS: the European mo-
lecular biology open software suite. Trends Genet. 16, 276–277.
27. Lorenz, R., Bernhart, S.H., Honer Zu Siederdissen, C., Tafer, H., Flamm,
C., Stadler, P.F., and Hofacker, I.L. (2011). ViennaRNA package 2.0.
Algorithms Mol. Biol. 6, 26.
28. Dereeper, A., Guignon, V., Blanc, G., Audic, S., Buffet, S., Chevenet, F.,
Dufayard, J.F., Guindon, S., Lefort, V., Lescot, M., et al. (2008).
Phylogeny.fr: robust phylogenetic analysis for the non-specialist.
Nucleic Acids Res. 36, W465–W469.
29. Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S.,
Batut, P., Chaisson, M., and Gingeras, T.R. (2013). STAR: ultrafast univer-
sal RNA-seq aligner. Bioinformatics 29, 15–21.
30. Haas, B.J., Papanicolaou, A., Yassour, M., Grabherr, M., Blood, P.D.,
Bowden, J., Couger, M.B., Eccles, D., Li, B., Lieber, M., et al. (2013). De
novo transcript sequence reconstruction from RNA-seq using the Trinity
platform for reference generation and analysis. Nat. Protoc. 8, 1494–1512.
31. Kim, D., Langmead, B., and Salzberg, S.L. (2015). HISAT: a fast spliced
aligner with low memory requirements. Nat. Methods 12, 357–360.
32. Nawrocki, E.P., and Eddy, S.R. (2013). Infernal 1.1: 100-fold faster RNA
homology searches. Bioinformatics 29, 2933–2935.
33. Bouquet, J.M., Spriet, E., Troedsson, C., Ottera, H., Chourrout, D., and
Thompson, E.M. (2009). Culture optimization for the emergent zooplank-
tonic model organism Oikopleura dioica. J. Plankton Res. 31, 359–370.
34. Lamble, S., Batty, E., Attar, M., Buck, D., Bowden, R., Lunter, G., Crook,
D., El-Fahmawi, B., and Piazza, P. (2013). Improved workflows for high
throughput library preparation using the transposome-based Nextera sys-
tem. BMC Biotechnol. 13, 104.
35. Brena, C., Cima, F., and Burighel, P. (2003). Alimentary tract of
Kowalevskiidae (Appendicularia, Tunicata) and evolutionary implications.
J. Morphol. 258, 225–238.
36. Henriet, S., Sumic, S., Doufoundou-Guilengui, C., Jensen, M.F.,
Grandmougin, C., Fal, K., Thompson, E., Volff, J.N., and Chourrout, D.
(2015). Embryonic expression of endogenous retroviral RNAs in somatic
tissues adjacent to the Oikopleura germline. Nucleic Acids Res. 43,
3701–3711.
37. Lu, Z., Guan, X., Schmidt, C.A., and Matera, A.G. (2014). RIP-seq analysis
of eukaryotic Sm proteins identifies three major categories of Sm-contain-
ing ribonucleoproteins. Genome Biol. 15, R7.
Current Biology 29, 3193–3199, October 7, 2019 3199
STAR+METHODS
KEY RESOURCES TABLE
REAGENT or RESOURCE SOURCE IDENTIFIER
Antibodies
Anti-2,2,7-Trimethylguanosine Antibody, clone K121 Merck MABE302; RRID:AB_213109
Anti-Smith Antigen antibody [Y12] AbCam ab3138; RRID:AB_303543
Biological Samples
Fritillaria borealis Marine Biological Station, Espegrend
and Rosslandspollen, Rossland
N/A
Oikopleura dioica Sars Centre, University of Bergen N/A
Fritillaria pellucida Linda Holland, La Jolla N/A
Appendicularia sicula Marine Biological Station, Espegrend N/A
Chemicals, Peptides, and Recombinant Proteins
Trizol reagent Invitrogen Cat#15596026
RNase R Epicenter Cat#RNR07250
Terminator 50-Phosphate-Dependent Exonuclease Epicenter Cat#TER51020
Tobacco Acid Pyrophosphatase Epicenter Cat#T81050
Pladienolide B Santa Cruz Biotechnology Cat# sc-391691
Isoginkgetin Sigma-Aldrich Cat#416154
Critical Commercial Assays
REPLI-g Single Cell kit QIAGEN Cat#150343
Nextera DNA Library Prep Kit Illumina Cat#FC-121-1030
MiSeq Reagent Kit v3 (600-cycle) Illumina Cat#MS-102-3003
TruSeq Stranded Total RNA library prep Illumina Cat#RS-122-2201
Nucleospin RNA XS Macherey-Nagel Cat#740902.10
SMART-seq v4 Ultra low input RNA kit Takara Cat#634888
Deposited Data
Raw and analyzed data This paper
Fritillaria borealis genome [6] GenBank: SDII00000000
Oikopleura dioica genome [2] http://www.genoscope.cns.fr/externe/
Download/Projets/Projet_HG/data/assembly/
Other larvacean genomes [6] GenBank: SCLD01000000 to SCLH01000000
Ascidian genomes [20] https://www.aniseed.cnrs.fr/
RFAM [21] http://rfam.xfam.org/
PFAM [22] https://pfam.xfam.org/
Experimental Models: Cell Lines
HEK293T ATCC Cat#CRL-3216
Software and Algorithms
Blobology [23] https://github.com/blaxterlab/blobology
Trimmomatic [24] http://www.usadellab.org/cms/index.php?
page=trimmomatic
Spades [25] https://github.com/ablab/spades
EMBOSS package [26] http://emboss.sourceforge.net/
BLAST package https://blast.ncbi.nlm.nih.gov/Blast.cgi?
CMD=Web&PAGE_TYPE=BlastDocs&DOC_
TYPE=Download
The viennaRNA package [27] https://www.tbi.univie.ac.at/RNA/#download
MUSCLE [28] http://www.phylogeny.fr/simple_phylogeny.cgi
Gblocks [28] http://www.phylogeny.fr/simple_phylogeny.cgi
(Continued on next page)
e1 Current Biology 29, 3193–3199.e1–e4, October 7, 2019
Continued
REAGENT or RESOURCE SOURCE IDENTIFIER
PhyML [28] http://www.phylogeny.fr/simple_phylogeny.cgi
STAR [29] https://github.com/alexdobin/STAR
Trinity [30] https://github.com/trinityrnaseq/trinityrnaseq/
releases
HISAT2 [31] https://ccb.jhu.edu/software/hisat2/index.shtml
Infernal [32] http://eddylab.org/infernal/
LEAD CONTACT AND MATERIALS AVAILABILITY
Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, Daniel
Chourrout ([email protected]). This study did not generate new unique reagents.
EXPERIMENTAL MODEL AND SUBJECT DETAILS
Specimens of Oikopleura dioica, Fritillaria borealis and Appendicularia sicula, were collected in the fjords around Bergen. Fritillaria
pellucida specimens were collected near La Jolla (CA, USA). O. dioica was collected and kept in culture in the lab according to
described procedures [33]. F. borealis specimens were collected between February and April by scooping seawater 2 to 15 m below
the surface, then grown in the lab at 14�C. Animals become well visible when their gonads start to develop. At this stage, 150-300
individuals are transferred in an 18L beaker containing UV-treated, filtered seawater supplemented with an algi diet [33]. Spawning
usually takes place after two days. Three days after spawning, the culture beaker is diluted two-fold and animals become visible again
after one to two days.
METHOD DETAILS
Genome assemblyWhole-genome sequencing and assembly for six larvacean species, including F. borealis, were recently reported [6]. For sequencing
the F. pellucida genome, DNA from a pool of six animals was prepared with a modified tagmentation procedure [34] based on the
Nextera kit (Illumina), and sequenced on MiSeq. This approach did not succeed with A. sicula DNA, possibly due to contaminants
from the blind gut [35]. To address this issue, we amplified the genome of a single animal with the REPLI-g Single Cell Kit (QIAGEN),
prior to tagmentation and sequencing. The 300-nts PE reads were trimmed with Trimmomatic [24]. All reads whose length was at
least 36 bp were subsequently assembled with SPAdes genome assembler [25]. The assemblies were then checked for contamina-
tion using Blobology [23]. The sizes of the resulting assemblies were 174 Mb for F. pellucida and 172 Mb for A. sicula, with scaffold
N50 values 855 bp and 2438 bp, respectively.
Transcriptome assembly and intron annotationWe used the Trizol reagent (Thermo) to extract RNA from pools of F. borealis at different developmental stages. Transcriptomes from
juveniles and adults were prepared and sequenced on MiSeq (150 nts PE) at Eurofins Genomics (Ebersberg, Germany), resulting in
6267518 raw reads. Transcriptomes fromembryos and larvaewere preparedwith SMART-seq v3Ultra low input RNA kit (Takara) and
sequenced on MiSeq (300 nts PE), resulting in 25787032 raw reads. Reads were trimmed using Trimmomatic and checked with
FASTQC. Reads were then assembled into transcripts with Trinity software [30], using the default parameters.
The mass annotation of F. borealis introns is based on alignments with the two transcriptomes. Most transcripts are present in the
adult transcriptome, which was therefore preferentially used. An issue for accurate determination of intron limits is the frequent pres-
ence of one or a few identical nucleotides at both ends of the alignment gap (in most cases due to the frequent exonic TAC/TAT triplet
preceding the intron). It was solved when observing for 99.2% of unambiguous gaps (no repeated nucleotides), that the second last
nucleotide of the intron is an adenosine. Using this information, introns could be precisely annotated. Prior to determining intron
limits, intron orientation was determined using GETORF (http://www.bioinformatics.nl/cgi-bin/emboss/getorf) [26] applied on tran-
script sequences. The sizes of the longest ORF for each orientation were measured and compared. Transcript orientation was
considered reliable when the longest ORF in one orientation was at least twice as long as the longest ORF in the other. When this
rule was not satisfied, the intron was not considered further. The number of orientation errors was shown to be minimal, by checking
the GETORF based orientation for highly conserved genes using BLASTX on Mouse protein Refseq (NCBI). The majority of introns
oriented in such a way had an adenosine as a second last nucleotide. In a few cases, there were two or more adenosines as candi-
dates for the second last nucleotide, preventing a reliable annotation, and those introns were not included further. A collection of
5.645 introns was considered annotated with sufficient level of confidence from an initial set of 19.214 gaps in the transcriptome
to genome alignment. Further experiments detecting lariats of non-canonical introns confirmed that the annotation was correct.
Due to the absence of a transcriptome, introns in F. pellucida, A. sicula andO. longicaudawere restricted to a few hundred elements
Current Biology 29, 3193–3199.e1–e4, October 7, 2019 e2
from genes highly conserved with vertebrates (BLASTX). Gaps were determined based on the best BLASTX alignment (ambiguous
cases were not considered), while the sequence orientation was not problematic. The second last intron base rule was also applied.
For counting the incidence of the main intron types in all other genomes (human, amphioxus B. floridae, ascidian Ciona intestinalis,
sea urchin S. purpuratus, O. dioica, O. albicans and O. vanhoeffeni), the same annotation method was used after aligning reference
transcriptome on reference genome sequences. It used GETORF for sequence orientation and the second last base rule. GT-AG and
nearly-canonical G(nonT)-AG introns were identified first, the rest of introns being considered as non-canonical (Figure 5A).
Intron densities were compared among distinct genomes using a sample of 109 coding sequences from highly conserved genes
present in the genome sequence and represented in transcriptomes. Altogether, approximately 126 kb (42 kb of AA sequences) of
these coding sequences could be aligned among species and BLASTN alignments of genomes and transcripts allowed to count the
number of introns for each of them. An incomplete dataset (40 of the 109 genes) fromA. siculawas added to check whether or not the
trend for massive loss of introns may be a feature of fritillarids. For this species, the genome sequencing coverage is low (approx 4X)
andwe have no transcriptome data. Therefore, intronswere counted based on interruptions of TBLASTN alignments between protein
sequences of human and F. borealis and the A. sicula genome data.
Lariat-seqTrizol-extracted RNA from F. borealis or O. dioica was treated with DNase (TURBO DNA-free kit, Ambion), dissolved in ddH2O (final
concentration 0.45 mg.ml-1 in 5 ml), denatured 5 min at 65�C and placed on ice. Linear RNA was degraded 4h at 37�C with 10U RNase
R (Epicenter) in the manufacturer’s buffer, then 2.5 mM EDTA were added. We checked the efficiency of RNase R treatment by
comparing the electrophoretic profile of the sample against untreated control. After buffer exchange, RNA was annealed to random
10mers (Ambion) and prepared for sequencingwith the TruSeqStranded Total RNA kit (Illumina), omitting the rRNAdepletion step. Dur-
ing library preparation, we used size-selection on AMPure beads (Beckman coulter) to remove free adapters. This step may have
depleted short lariat cDNA and could account for someover-representation of longer, non-GTAG lariats (> 100 nts) in theO. dioicadata-
set. Lariat cDNA libraries were run together on MiSeq (250 nts PE), producing 11104078 and 9969578 filtered paired reads for the
F. borealis and theO. dioica sample, respectively. Reads were aligned with BLAST against a database of genes with annotated introns.
We selected reads that yield two sub alignments andwemapped the branch points by examining the transition between the alignments.
Splicing inhibitor assayJuvenile F. borealis were placed in 1 mL plastic dishes coated with 1% agarose, in filtered artificial sea water (Red Sea,
30.1-30.5 g.L-1 salinity) supplemented with 10 mM Pladienolide B in DMSO (PlaB, Santa Cruz Biotech), or 0.6% DMSO. After 3h in-
cubation at 10�C, animals were transferred in a collection tube and total RNA was extracted with Nucleospin RNA XS (Macherey-
Nagel). We prepared cDNA for RT-PCR assays as previously described [36], and we used 100 pg of total RNA to prepare Illumina
libraries (SMART-seq v4 Ultra Low Input RNA kit, Clontech). Libraries were run together on MiSeq (300 nts PE), producing
10463635 and 11843234 filtered paired reads for the PlaB sample and the DMSO control, respectively. Reads were aligned on
the F. borealis genome assembly, with HISAT2 [31] by using a collection of intron positions and by disabling the penalty against
non-canonical splice sites, resulting in an alignment rate of 81.83% for the PlaB sample and 82.01% for the control. Intron retention
rates were scored by examining a subset of reads encompassing a collection of exon-intron junctions (see ‘‘Quantification and sta-
tistical analysis’’).
Cap-dependent RNA-seqRNA (final concentration 5 mg.ml-1 in 100 ml) was either digested for 1h at 30�C with 5U of Terminator 50-Phosphate-DependentExonuclease (Epicenter) or incubated without enzyme for the input control. Reactions were stopped with 5 mM EDTA, and RNA
was purified with Phenol/Chloroform extraction. We treated one sample with Tobacco Acid Pyrophosphatase (TAP, Epicenter) in or-
der to increase the ligation efficiency of RNA-seq adapters to the 50 end, and compared it to the non-treated control. RNA fragments
in the 100-300 nts range were gel-purified and prepared for stranded RNA-seq at Ocean Ridge Biosciences (Deerfield Beach, FL,
USA). Trinity software was used to assemble the reads into transcripts. Subsequently, Infernal cmscan was used to scan the
RFAM database and annotate the transcripts accordingly [22, 32]. To search for RNA involved in splicing, we first ranked transcripts
by their abundance in the input sample, in the TAP-treated sample and in the non-treated sample. Canonical U2-type snRNAs re-
mained present among the 15 most abundant RNA in all samples, and TAP treatment generally increased their reads number (Fig-
ure 4B). U1 snRNA identified with RFAM were checked for the conservation of the prominent motif - ATACTTACCTG, located in the
first 11 nucleotides of the sequence. For other RNAs, neither snRNA-like secondary structure motifs or complementarity to non-ca-
nonical 50ss, such as TAC(T)AG (Figure 1B) were found.
RNA-immunoprecipitationBetween 150 to 200 F. borealis individuals were collected at the onset of gonadmaturation, washed in artificial seawater and pelleted
in Eppendorf tubes. Seawater was removed and animals were homogenized in RIP lysis buffer (DTT 2 mM, Ribonucleoside Vanadyl
Complex 5mM,NaCl 0.1M,MgCl2 5mM, glycerol 10%, HEPES 50mMpH7.5, 0.1%Triton X-100, PMSF 1mM,Complete-EDTA free
(Roche)). Sample droplets were snap-frozen in liquid N2, grinded with mortar and pestle, and the material was further homogenized
with ten passages in a 25G needle then diluted with IPP buffer (NaCl 0.15 M, HEPES 20 mM pH7.5, Triton X-100 0.05%, MgCl21.5 mM). An amount corresponding to 5 mg of RNA was incubated 2h at 4�C in presence of 10 mg yeast tRNA and 7.5 mg of either
e3 Current Biology 29, 3193–3199.e1–e4, October 7, 2019
rabbit serum, anti-TMG IgG (K121MABE302,Merck) or anti-Sm IgG (ab3138, AbCam), in a final volume of 0.75ml. After binding, 75 ml
of Dynabeads-protein G (Thermo) pre-washed with IPP buffer and yeast tRNA were added and the samples were further incubated
1h at 4�C. Beadswerewashed three timeswith 0.5mL ice-cold IPP buffer, transferred to a new tube andwashed oncemore. Proteins
were digested for 45 min at 37�C in 0.4 mL of PK buffer (Tris-Cl 10 mM pH7.5, EDTA 10 mM, NaCl 0.1 M, SDS 0.5%, 4 mg proteinase
K), and RNA was recovered with Phenol/Chloroform extraction and treated with DNase. We prepared Illumina libraries using the
TruSeq Stranded Total RNA kit as previously mentioned, and sequenced the cDNA on MiSeq (300 nts PE). The reads were mapped
onto the genome assembly using STAR software [29] with default parameters, and collecting all alignments with 20 or more matched
bases (–outFilterMatchNmin 20). The number of uniquely mapped reads were respectively 1693034, 3279711 and 2452203 for the
control, anti-Sm and anti-TMG experiments. Library normalization, read count and background analysis were performed as
described by Lu et al. [37] (see ‘‘Quantification and statistical analysis’’).
Splicing assays in mammalian cellsGene fragments consisting in full-length introns flanked on both sides by 50 bp of exon sequences, were PCR amplified from
O. dioica and F. borealis DNA and cloned between the restriction sites SacI and XmaI in the mammalian expression vector
pEGFP-N1 (Clontech). To test sequence requirement for splicing, we changed the wild-type splice sites to mutants sequences using
PCR mutagenesis. We delivered the constructs with Polyethylene Imine to adherent HEK293T cells at 40% confluency. Cells were
harvested after 3 days growth at 37�C in DMEM. RNA was extracted with Trizol, treated with DNase, and amplified with RT-PCR as
previously described [36]. After gel electrophoresis, PCR products corresponding to spliced RNA were cloned and sequenced.
Homology searchesWe looked for U2- and U12-type snRNA in the fritillarid genome assemblies with BLAST using parameters for short queries. To build
our set of query sequences, we retrieved U2- and U12-type snRNA obtained from the RFAM library (http://rfam.xfam.org/), and we
supplemented snRNA candidates recovered with BLAST searches on tunicate genomes. Those include newly sequenced larva-
ceans [6] and ascidians genomes available online [20]. We confirmed snRNA identity using RNA structure prediction, RNA multiple
alignment [27], and RFAM-based annotation.
Based on proteomic studies of the snRNPs [12], we checked if the major protein components of the spliceosome were present in
F. borealis. We performed TBLASTN searches against the F. borealis transcriptome, using proteins queries from human,
D. melanogaster, C. elegans and S. cerevisiae. In most cases, the results revealed a single homolog with significantly higher scores
than other hits and for few exceptions, a duplicate was also found. Some ambiguous cases, corresponding to either proteins with
higher divergence in F. borealis or gene families, were resolved with multiple sequence alignment and by examining protein domains
with PFAM [21]. For each positive hit found in F. borealis, we performed reciprocal BLAST against NCBI-refseq and UNIPROT to
confirm the annotation. Similarity scores were calculated over local alignments that exclude low-complexity regions, using
BLOSUM50 matrices. Protein phylogenies were established with the ‘‘one click’’ pipeline on the Phylogeny.fr server [28].
QUANTIFICATION AND STATISTICAL ANALYSIS
RNA-immunoprecipitationFor RNA-immunoprecipitation (RNA-IPs) experiments, we measured the read coverage on targets corresponding to either snRNA
(U1, U2, U3, U4, U5, U6 and splice leader), rRNA (LSU, SSU and 5S), exons or introns. We excluded targets with less than ten reads
and we normalized the coverage with library size. For each target, the read enrichment (E) in IPs performed with anti-Sm and anti-
TMG IgGs, was calculated with E = log2ððIgG + 2=control + 2ÞÞ, where IgG and control are the respective coverage values for the
IPs, and for the control with serum only. Assuming that the majority of reads would correspond to non-specific binding events, we
used a two-component Gaussian mixture model to calculate the distribution of E in each sample, and to determine a threshold value
representing the background.
Splicing inhibitor assayThe intron retention rates were estimated genome wide based on RNA-seq reads. For statistical assessment of splicing inhibitor
effects, 1067 distinct intron-exon regions (1002 for non-canonical introns and 65 for canonical introns) were selected because pro-
ducing at least 50 exploitable sequencing reads from each of the DMSO control and the PlaB treated sample. Of the 1067 pairwise
comparisons of retention rates between control and treated groups, 238 fulfilled the conditions for application of the z-test. In these
comparisons, 227 showed significantly different retention rates (p < 0,05), with 223 and only 4 corresponding to increase and
decrease of retention rate, respectively.
DATA AND CODE AVAILABILITY
The experimental datasets supporting the current study have not been deposited in a public repository but are available from the
Lead Contact on request.
Current Biology 29, 3193–3199.e1–e4, October 7, 2019 e4
Current Biology, Volume 29
Supplemental Information
Evolution of the U2 Spliceosome for Processing
Numerous and Highly Diverse Non-canonical Introns
in the Chordate Fritillaria borealis
Simon Henriet, Berta Colom Sanmartí, Sara Sumic, and Daniel Chourrout
SUPPLEMENTAL INFORMATION :
Figure S1: Logos of the main categories of non-canonical introns in F. borealis. Related to
Figure 1. In all cases, the intron tail is not T-rich like in canonical introns. In all cases,
conservation extends to the last triplet of the upstream exon.
Figure S2: Genome wide prevalence of non-canonical introns in F. borealis. Related to
Figure 1. A Annotated introns in two representative segments of the genome sequence, based
on alignment with transcriptome. RNA-seq coverage is represented over the predicted genes.
Terminal Inverted Repeats (TIRs) were scored with einverted [S1]. B Number of canonical and
non-canonical introns counted in 406 distinct genes having highly conserved sequences.
BLASTX alignments of these F. borealis genes with their vertebrate orthologues (refSEQ or
Swissprot) permits to ensure that they are most likely full length. Intron annotation results from
alignments with transcripts or the visual inspection of predicted protein translated product with
vertebrate proteins. Overall, almost all genes contain one or more non-canonical introns. C
Conservation of intron positions in other deuterostomes. F. borealis GT/AG introns were
considered only if they could be precisely localized in alignments with their putative
orthologues of four other species, including an ascidian and the larvacean O. dioica. Not more
than six of the 76 introns have the same position in all five species, but 42 of them have
conserved position with at least one species. In contrast, none of the non-canonical introns of
F. borealis have a conserved position (bottom row). D Gene ontology analysis of F. borealis
conserved genes, which have or not, canonical introns (same selection of putative full-length
genes as for B. For the GO analysis with the David package (BP1 GO terms), mouse
orthologous proteins were used to measure the level of enrichment vs the mouse proteome. We
observed overall enrichment for genes involved in biological regulation and development for
genes having canonical introns, but not for genes without canonical introns.
Figure S3: Groups of repeated introns the in F. borealis genome. Related to Figure 1.
Image of ClustalW matrix for 2330 sequences from 1165 introns (forward and reverse
orientation) repeated at least ten times in the genome (selection based on BLASTN in a
collection of 19214 introns). Each sequence is obtained by assembling the 50 bp long ends of
each introns with a 10N spacer in between. Darker dots correspond to higher sequence
similarity. Overall, three main groups of repeated introns are formed (dashed squares), and one
subgroup of highly conserved introns in each of them. The sequence logos for each group and
subgroup are provided, showing distinct consensus sequences but a conserved palindromic
arrangement between the 5’ and 3’ ends.
Figure S4: MITE transposition is the source of new introns in F. borealis. Related to
Figure 1. A Multiple sequence alignment of nine non-canonical introns (uppercase) and their
flanking sequences (lowercase), showing the conservation of Terminal Inverted Repeats (TIRs)
and Target Site Duplications (TSDs). The middle part of the alignment is not shown (NNN).
TIRs, TSDs, high similarity between copies and the absence of an internal orf are hallmarks of
MITEs (Miniature Inverted repeats Transposable Element) [S2]. Note that the beginning of
inverted repeats (IR) are shifted relative to the intron borders (-1 nt in 5’, -3 nts in 3’), in such
a way that the 5’ ss is included in the 5’ IR and the 3’ ss is included in the 3’ TSD. On the
bottom, arrows indicate which intervals that have been considered for identifying TSDs
genome-wide. B Mapping of TSDs in a collection of 3657 introns, by taking into account
successive intervals away from the intron borders. Results show an overrepresentation of the
dinucleotide TA, suggesting a Tc1/Mariner transposase could be involved in MITE integration
[S3]. C Breakdown of TSD identity for different intron classes. A majority of canonical introns
has no flanking TSDs, even when considering different combinations of intervals away from
intron borders. D The highly prevalent TAC or TAT triplet preceding non-canonical introns is
supposed to predate MITE insertion at a TA site, before it is eventually intronized. That the
TAC or TAT triplet indeed predated the insertion could be assessed by measuring the level of
conservation of this site across chordates. In phase 0 introns interrupting a protein coding
sequence, both TAC and TAT codons encode a tyrosine (Y). Alignments of genes highly
conserved between amphioxus (B. floridae) and F. borealis show that tyrosine residues are
equally well preserved, irrespective of whether their codon precedes or not an intron. This result
supports that the triplet did not experience specific evolution after intron acquisition. Similar
analysis for phase 1 introns supports that both AC or AT pairs of nucleotides adjacent to introns
did not evolve differently from those located far from introns (data not shown). E Model for
MITE transposition, based on integration site preferences: 1) a MITE and its IRs are recognized
by a transposase and excised, forming an active transposome; 2) the transposase cuts the target
DNA after a TA dinucleotide preceding a C or T, it is possible that exonic sites are preferred
due to compact gene arrangement or better chromatin accessibility [S4]; 3) after MITE
integration, the repair of flanking sequences generates the TSDs.
Figure S5: The F. borealis snRNA complement. Related to Figure 5. A Secondary structure
models of full-length snRNA identified in F. borealis, based on genome and transcriptome
mining. For U1, arrows show substitutions found in the variant U1b. For U5, blue arrows
indicate the position of the multiple alignment, gray arrows show substitutions found in A.
sicula or F pellucida. Residues predicted to form a U2/U6 hybrid are highlighted in pink. B
Multiple alignment of the first stem-loop in U5, showing the fritillarid-specific sequence
change.
Figure S6: Species-specific changes in spliceosomal proteins. Related to Figure 5. A The
drawing represent Prp8 domain organization. The alignment shows sequence variation at two
motifs interacting with 5’ and 3’ss [S5], where fritillarid-specific substitutions are present. B
Conservation of the N-terminal domain of U1C. The drawing represents the secondary structure
based on published structure of U1 snRNP [S6] and the position of non-conservative
substitutions in fritillarids. C The F. borealis SR proteins. Left, domain organization of SR
proteins from human (H. sapiens), an ascidian (C. intestinalis) and three larvaceans (O. dioica,
O. albicans and F. borealis). Right, distribution of RX repeats in the protein. Darker shades
correspond to higher number of repeats. In F. borealis, the repertoire of SRSF2 proteins has
expanded and most of them have acquired an N-terminal, RS-rich extension.
Figure S7: Prevalence of non-canonical introns in two other fritillarid species. Related to
Figure 6. In the absence of transcriptome for these two species, introns were localized by
inspecting highly conserved gene sequences based on blastX alignments. In total, 252 and 175
introns of F. pellucida and A. sicula could be annotated. The trend is similar to that observed
in F. borealis, with a prevalence of TAC/TAT triplets before non-canonical but not canonical
introns. This prevalence is less obvious in nearly canonical introns.
Gene Intron Borders Splicing ? Intron Borders Splicing ?
Unknown Od1c GTAG yes
Od2nc GAAG no Od3c GTAG yes
Tektin domain Od6c GTAG yes Od4nc GAAG no
Od5nc GAAG no
DEAD helicase Od7c GTAG yes Od8nc GAAG no
Table S1: In vitro splicing assays with O. dioica introns. Related to Figure 4. Gene
fragments containing canonical and non-canonical introns were expressed in HEK293T cells
and splicing was monitored using RT-PCR with primers flanking the introns.
SUPPLEMENTAL REFERENCES :
S1. Rice, P., Longden, I., and Bleasby, A. (2000). EMBOSS: the European Molecular
Biology Open Software Suite. Trends Genet 16, 276-277.
S2. Yang, G., Nagel, D.H., Feschotte, C., Hancock, C.N., and Wessler, S.R. (2009). Tuned
for transposition: molecular determinants underlying the hyperactivity of a Stowaway
MITE. Science 325, 1391-1394.
S3. Tellier, M., Bouuaert, C.C., and Chalmers, R. (2015). Mariner and the ITm
Superfamily of Transposons. Microbiol Spectr 3, MDNA3-0033-2014.
S4. Huff, J.T., Zilberman, D., and Roy, S.W. (2016). Mechanism for DNA transposons to
generate introns on genomic scales. Nature 538, 533-536.
S5. Shi, Y. (2017). The Spliceosome: A Protein-Directed Metalloribozyme. J Mol Biol
429, 2640-2653.
S6. Kondo, Y., Oubridge, C., van Roon, A.M., and Nagai, K. (2015). Crystal structure of
human U1 snRNP, a small nuclear ribonucleoprotein particle, reveals the mechanism
of 5' splice site recognition. Elife 4.