supplemental information the genome of the … · the genome of the foraminiferan reticulomyxa...
TRANSCRIPT
Current Biology, Volume 24
Supplemental Information
The Genome of the Foraminiferan
Reticulomyxa filosa
Gernot Glöckner, Norbert Hülsmann, Michael Schleicher, Angelika A. Noegel,
Ludwig Eichinger, Christoph Gallinger, Jan Pawlowski, Roberto Sierra, Ursula Euteneuer,
Loic Pillet, Ahmed Moustafa, Matthias Platzer, Marco Groth, Karol Szafranski,
and Manfred Schliwa
Inventory of Supplemental Information
Supplemental Figures
Figure S1. Correlation of prediction score, blast score, and transcriptional activity, Related to Table 1
Figure S2. Dynein and Actin phylogenies, Related to Table 2
Figure S3. Phylogenetic relationships of 27 SAR (Stramenopiles, Alveolata, and Rhizaria) and 10 outgroup species (Archaeplastida and Haptophyta) based on 115 genes, Related to Figure 1
Figure S4. Churchill domain occurrence in the eukaryote tree of life, Related to Figure 3
Supplemental Tables
Table S1. Assemblies of raw reads derived from 454/Roche and Illumina sequencing, Related to Table 1
Table S3. The most prominent Pfam domains in the R. filosa genome, Related to Table 1
Table S4. Signaling components in two Rhizaria and N. gruberi , Related to Table 3
Table S5. Proteins with kinesin domains in the R. filosa genome, Related to Table 2
Table S7. Rhizaria specific gene families with potential functions, Related to Figure 3
Table S8. Potential bacterial horizontal gene transfer into the R. filosa genome, Related to Figure 4
Table S9. Genes in the R. filosa genome with potential associations with photosynthesis, Related to Figure 4
Supplemental Experimental Procedures
Supplemental References
Supplemental Figures
Figure S1, related to Table 1: Correlation of prediction score, blast score, and transcriptional activity.
Predicted genes were grouped according to their prediction score calculated in geneid. The groups were
then evaluated according to Blast matches against the complete protein refseq database at NCBI and
whether there are matching reads in the RNAseq data set. The number of genes with a certain Blast
score (red) or prediction score (blue) is given at the Y-axis. The percentage of genes with Blast scores and
the percentage of transcribed genes for each score category are depicted on the secondary axis.
Figure S2, related to Table 2: Dynein and Actin phylogenies.
A: We found 10 different dynein proteins in the predicted gene set, which could all completely be
reconstructed using available similarity information and transcript data. All dyneins could be grouped
into the nine defined categories including the dynein members which are involved in flagellar motility.
The tree topology is based on an ML tree calculated with the LG model with estimated gamma
distribution parameters. Numbers at the branches indicate support from ML calculation/MrBayes
posterior probabilities/ neighbor joining calculation using the poison model. Branches with asterisks
indicate differing topologies between the calculation methods.
B: ML tree of all available Foraminifera actins together with actins from selected species calculated on an
alignment with Gblocks purging. The tree is rooted with an actin-like protein from B. natans. For a better
readability the branches are collapsed to R. filosa and Foram specific clusters. Clusters consist of the
following orders: Foram_1 (Tetromphalus 5, Rosalina 1, Bolivina 2, Bulimina 2, Ammonia 5, Rotaliina 3,
Elphidium 4, Globigerinella 1, Hyalinea 3, Stainforthia 1, Reophax 1, Nonionella 5, Globobuliminia 4);
Foram_2 (Amphisorus 6, Marginopora 4, Miliolidae 4, Miliammina 2); Foram_3 (Bathysiphon 5,
Toxisarcon 1); Foram_4 (Allogromia 6, Bathysiphon 2, Edaphoallogromia 3); Foram_5 (Sorites 4,
Edaphoallogromia 2, Allogromia 3, Trochammina 2, Haynesina 4, Reophax 2); Foram_6 (Ammonia 2,
Tretomphalus 1, Hyalinea 1, Elphidium 2); Foram_7 (Bolivina 1, Bulimina 1); Foram_8 (Tetromphalus 1,
Bolivina 1); Foram_9 (Rosalina 1, Bolivina 1). The MrBayes tree based on the Gblocks alignment yielded
the same clusters of related proteins after 2 million iterations but the chains did not converge further
after 350000 steps. This method thus was unable to resolve the higher order relationships of the
Foraminifera. The neighbor joining tree calculated with the JTT model yielded also results differing from
that shown. The general clustering of related sequences, however, was the same with all methods used.
Especially the clustering of the H. sapiens, D. fasciculatum, A. thaliana, and N. gruberi proteins remained
stable. This analysis thus indicates high divergence of actin sequences in all foraminifera genomes.
Figure S3, related to Figure 1: Phylogenetic relationships of 27 SAR (Stramenopiles, Alveolata
and Rhizaria) and 10 outgroup species (Archaeplastida and Haptophyta) based on 115 genes. The
position of R. filosa is highlighted. The tree was obtained as the highest scoring maximum likelihood tree
using LG+Γ model and empirical amino acid frequencies. The numbers at nodes indicate the topological
support estimated by 1000 bootstrap replicates and Bayesian consensus posterior probabilities of post-
burnin bipartitions. Solid circles represent maximum support.
Figure S4, related to Figure 3: Churchill domain occurrence in the eukaryote tree of life. The tree
represents the currently accepted phylogeny of eukaryotes. Dashed lines indicate unsure relationships.
The occurrence of churchill proteins is depicted as circles above the respective branches. Black filled
circles indicate presence in all or most currently known genomes, open circles indicate absence. The
Ecdysozoa in the Metazoa lineage lost this domain. Grey in Amoebozoa shows spurious occurrence (only
1 of 9 completely sequenced genomes contains this domain). SAR: Stramenopiles, Alveolata, and
Rhizaria.
Tables
Table S1, related to Table 1: Assemblies of raw reads derived from 454/Roche and Illumina
sequencing. The Illumina raw reads amounted to 22.9 GB and the 454 reads to 1.6 GB. The assemblies
with newbler were done with the 454 reads only and the ABySS assemblies with those from the Illumina
sequencer. All other assemblies made use of both resources. Duplicon removal was done with inhouse
software, while the diginorm assembly made use of a normalization procedure implemented in the
diginorm package (http://ged.msu.edu/papers/2012-diginorm/). The merged assembly is derived from a
manually curated merger of the Newbler assembly with the ABySS 500 assembly. This merged assembly
was used for further analysis.
Software threshold number of
contigs
total size of
contigs
Mean length
of contig
CLC 200 149,873 114,102,752 761
ABySS 200 126,521 119,674,360 946
ABySS 500 75,748 102,121,417 1,348
Newbler -- 124,291 94,262,983 758
Newbler 500 47,701 73,360,904 1,538
CLC with removal of duplicons 200 64,080 94,361,471 1,473
CLCdiginorm normalized reads 200 66,089 91,058,782 1,378
Merged assembly 500 45,292 100,460,861 2,215
Table S3, related to Table 1: The most prominent Pfam domains in the R. filosa genome.
# of
detected
Pfams
ID IPR_number name GO
11464 PF00400 IPR019781 WD40 repeat, subgroup
964 PF00515 IPR001440 Tetratricopeptide TPR-1 Molecular Function: protein binding (GO:0005515)
598 PF00069 IPR017442 Serine/threonine-protein
kinase-like domain
Molecular Function: protein kinase activity (GO:0004672),
Molecular Function: ATP binding (GO:0005524), Biological
Process: protein phosphorylation (GO:0006468)
265 PF01302 IPR000938 Cytoskeleton-associated
protein, Gly-rich domain
231 PF02176 -- TRAF type zinc finger
194 PF00503 IPR001019 Guanine nucleotide binding
protein (G-protein), alpha
subunit
Molecular Function: signal transducer activity (GO:0004871),
Biological Process: G-protein coupled receptor protein
signaling pathway (GO:0007186), Molecular Function: guanyl
nucleotide binding (GO:0019001)
193 PF00071 IPR013753 Ras
186 PF05729 -- NACHT domain
184 PF00023 IPR002110 Ankyrin repeat Molecular Function: protein binding (GO:0005515)
180 PF01344 IPR006652 Kelch repeat type 1 Molecular Function: protein binding (GO:0005515)
172 PF00076 IPR000504 RNA recognition motif domain Molecular Function: nucleic acid binding (GO:0003676)
113 PF00443 IPR001394 Peptidase C19, ubiquitin
carboxyl-terminal hydrolase 2
Molecular Function: ubiquitin thiolesterase activity
(GO:0004221), Biological Process: ubiquitin-dependent
protein catabolic process (GO:0006511)
110 PF01535 IPR002885 Pentatricopeptide repeat
106 PF00225 IPR001752 Kinesin, motor domain Molecular Function: microtubule motor activity
(GO:0003777), Molecular Function: ATP binding
(GO:0005524), Biological Process: microtubule-based
movement (GO:0007018)
99 PF00805 IPR001646 Pentapeptide repeat
96 PF00091 IPR003008 Tubulin/FtsZ, GTPase domain Cellular Component: protein complex (GO:0043234),
Biological Process: protein polymerization (GO:0051258)
96 PF00271 IPR001650 Helicase, C-terminal Molecular Function: nucleic acid binding (GO:0003676),
Molecular Function: helicase activity (GO:0004386),
Molecular Function: ATP binding (GO:0005524)
95 PF00169 IPR001849 Pleckstrin homology domain Molecular Function: protein binding (GO:0005515)
94 PF00036 IPR018248 EF-hand
90 PF00004 IPR003959 ATPase, AAA-type, core Molecular Function: ATP binding (GO:0005524)
Table S4, related to Table 3: Signaling components in two Rhizaria and N. gruberi.
R. filosa B. natans N. gruberi Pfam domain description
Cyclic nucleotide
34 11 121 PF00211 Adenylyl cyclase class-3/4/guanylyl cyclase
39 14 7 PF00233 3'5'-cyclic nucleotide phosphodiesterase,
75 56 10 PF00027 Cyclic nucleotide-binding domain
PIP signalling
40 32 15 PF00454 Phosphatidylinositol 3-/4-kinase, catalytic
31 24 8 PF01504 Phosphatidylinositol-4-phosphate 5-kinase,
14 5 12 PF00613 Phosphoinositide 3-kinase, accessory (PIK)
0 1 3 PF00387 Phospholipase C, phosphatidylinositol-specific,
0 2 6 PF00388 Phospholipase C, phosphatidylinositol-specific ,
Calcium signalling
69 72 88 PF00168 C2 calcium-dependent membrane targeting
94 81 57 PF00036 EF-hand
22 26 37 PF00122 ATPase, P-type, ATPase-associated domain
3 22 23 PF00612 IQ motif, EF-hand binding site
10 20 12 PF01699 Sodium/calcium exchanger membrane region
Heterotrimeric G
154 40 60 PF00503 Guanine nucleotide binding protein (G-
82 34 231 PF00615 Regulator of G protein signalling
small G proteins
193 83 233 PF00071 Ras
9 2 10 PF00616 Ras GTPase-activating protein
49 8 29 PF00617 Guanine-nucleotide dissociation stimulator
8 1 15 PF00618 Ras-like guanine nucleotide exchange factor, N-
25 31 36 PF00025 ARF/SAR superfamily
20 17 8 PF01412 Arf GTPase activating protein
15 5 4 PF01369 SEC7-like
43 43 28 PF00621 Dbl homology (DH) domain
34 28 25 PF00620 Rho GTPase-activating protein domain
3 5 4 PF02263 Guanylate-binding protein, N-terminal
22 37 25 PF01926 GTP-binding domain, HSR1-related
Phosphate
598 353 463 PF00069 Serine/threonine-protein kinase-like domain
34 81 20 PF07714 Serine-threonine/tyrosine-protein kinase
50 60 46 PF00149 Metallo-dependent phosphatase
70 50 38 PF03372 Endonuclease/exonuclease/phosphatase
42 49 40 PF00782 Dual specificity phosphatase, catalytic domain
6 10 7 PF00328 Histidine phosphatase superfamily, clade-2
12 8 6 PF00244 14-3-3 domain
11 43 11 PF00498 Forkhead-associated (FHA) domain
Histidine kinase
6 8 32 PF00072 Signal transduction response regulator,
4 2 27 PF00512 Signal transduction histidine kinase, subgroup
Sensors
1 9 61 PF00989 PAS fold
0 0 5 PF08376 Nitrate/nitrite sensing protein
0 11 5 PF04940 BLUF
Table S5, related to Table 2: Proteins with kinesin domains in the R. filosa genome. The R. filosa
genome contains 88 domains, which match at least partly a kinesin domain architecture. A large number
of the kinesin domain containing proteins are likely pseudogenes. Pseudogenization was assumed to
have happened if i) the domain was non-functional due to missing domain parts despite residing in an
apparently coding region, ii) the domain resides in an otherwise low complexity region without splice
signals for proper excision of introns, iii) unspliced transcripts covering small (20-40 bases) predicted
intron regions, iv) presence of domains associated with transposon activity.
Domain ID gene name Protein
length
localization of domain length of kinesin
domain
additional domain
52415_t solexa3720325_2.r1.exp_7 1032 3' complete 330 SMC; 5' of kinesin
52423_t contig21830_1.exp_5 746 3' complete 290 no
52476_t solexa3169777_1.exp_3 292 3' complete 230 no
52408_t contig09993_1.exp_11 650 3' complete 360 SMC; 5' of kinesin
52445_t solexa3798899_2.r1.exp_12 306 middle; complete 255 no
52446_t solexa3758974_7.r1.exp_13 997 5' complete 380 SMC
52459_t contig78665_1.exp_5 1025 5' complete 400 SMC
52402_t contig26551_1.exp_7 726 5' complete; 260 SMC
52417_t contig18412_1.exp_2 692 5' incomplete at 5' 300 no
52436_t contig30408_1.exp_1 1087 middle; complete start
at aa 460
340 SMC; 5' of kinesin
52438_t solexa3784960_1.f1.exp_2 649 3' complete; start at aa
320
220 SMC; additional weak
kinesin hit 5' 52434_t solexa3778539_1.f1.exp_1 636 5' complete 240 No
52449_t solexa3808122_1.exp_3 1177 3' complete; start at 630 360 no
52428_t contig24557_1.f1.exp_2 852 middle; complete start
at aa 150
280 no
52404_t solexa3725653_1.f1.exp_14 223 5' complete 223 no
52409_t solexa3777916_1.exp_9 527 middle; complete start
at aa 75
315 no
52433_t solexa3746604_1.exp_2 419 middle; complete start
at aa 90
300 no
52427_t contig24185_1.f1.exp_6 1376 5' complete 360 UBQ (ubiquitin
homologs) 52488_t contig40075_1.exp_1 556 5' complete 435 no
52463_t solexa1637608_1.r1.exp_3 905 5' complete 220 no
52412_t solexa3718088_1.f1.exp_9 842 5' complete 375 SMC
52457_t solexa3769552_1.exp_4 920 5' complete 275 SMC
52443_t contig32732_1.exp_3 798 5' complete 330 no
52447_t solexa3726968_1.f1.exp_4 701 5' complete 360 no
52462_t solexa3734740_1.f1.exp_2 240 3' incomplete at 3' 195 no
52431_t contig25635_1.f1.exp_3 485 middle complete 335 no
52448_t contig39107_1.exp_8 292 middle; complete 250 no
52456_t contig75133_1.f1.exp_7 1167 5' complete 265 SMC; BAH
52442_t solexa2602295_1.r1.exp_1 398 complete 280 no
52453_t solexa3763984_3.r1.exp_5 636 5' complete 310 no
52405_t solexa2052427_1.exp_1 1249 middle incomplete at 5'
start at aa 200
200 reverse transcriptase
domain 52425_t contig23116_1.f1.exp_4 660 5' complete 360 no
52450_t solexa1407312_1.exp_3 625 5' complete; start at aa
30
300 no
52478_t solexa3798613_1.exp_2 311 5' incomplete at 3' 200 no
52485_t contig09348_1.exp_4 626 5' complete 300 no
52487_t contig33946_1.exp_1 416 5' incomplete at 5' 285 no
52467_t contig98232_1.exp_1 329 start at 65; 3'
incomplete
210 no
52468_t contig96868_1.exp_1 1829 middle incomplete at 3',
starting at aa 250
250 reverse transcriptase
domain 52471_t solexa3789411_1.f1.exp_1 578 complete, starting at aa
100
265 no
52440_t solexa1708055_1.exp_6 282 3' complete 222 no
52458_t solexa3753414_1.f1.exp_9 860 5' complete 375 SMC
52414_t solexa3734507_1.r1.exp_1 654 5' complete 340 no
52426_t solexa3588784_1.f1.exp_2 405 3' complete 360 no
52432_t solexa3731590_1.exp_1 342 3' complete 290 no
52460_t solexa3759597_1.f1.exp_2 654 5' complete 265 no
52437_t contig28647_1.exp_4 681 5' complete 470 SMC
52444_t contig33457_1.exp_2 297 complete 290 no
52464_t contig92438_1.f1.exp_1 518 3' incomplete at 3' 248 no
52403_t solexa3756928_1.r1.exp_16 714 5' complete; start at aa
40
240 no
52461_t solexa3777157_1.f1.exp_2 217 incomplete at 5' and 3' 217 no
52486_t contig27187_1.exp_1 326 3' incomplete at 3' 250 no
52406_t solexa3746845_1.r1.exp_1 198 5' incomplete at 5'3 140 no
52407_t solexa3767652_1.f1.exp_4 180 5' incomplete at 5' 95 no
52410_t solexa3729075_1.exp_2 555 3' incomplete at 3'; start
at aa 375
125 no
52411_t contig15290_1.exp_8 244 5' incomplete at 5' 185 no
52413_t contig16113_1.exp_1 312 5' incomplete at 5' 120 no
52416_t solexa3777510_1.f1.exp_1 381 3' incomplete at 3' 105 no
52418_t solexa3717515_1.f1.exp_1 366 5' incomplete at 5' 80 no
52419_t solexa902554_1.f1.exp_2 177 3' incomplete at 3' start
at aa 35
110 no
52420_t solexa3746433_8.r1.exp_7 309 xx
52421_t solexa3724209_1.r1.exp_1 1036 5' incomplete at 5' 75 SMC
52422_t solexa1497943_1.exp_1 735 5' incomplete at 5' 170 no
52424_t contig22600_1.exp_2 698 5' incomplete at 5' 140 SMC
52429_t solexa3808444_1.exp_4 146 5' incomplete at 5' and
3'
95 no
52430_t solexa2609972_1.exp_1 126 5' incomplete at 5' 75 no
52435_t solexa2984574_1.exp_8 163 5' incomplete at 5' 163 no
52439_t contig29978_1.exp_1 441 5' incomplete at 5' 105 no
52441_t contig31234_1.exp_1 430 5' incomplete at 5' 105 no
52451_t solexa3803581_1.f1.exp_5 340 5' incomplete at 5' 40 no
52452_t solexa3723856_1.exp_3 169 3' incomplete at 3' and
5'
65 no
52454_t solexa3775391_1.exp_2 257 5' incomplete at 5' and
3'
140 no
52455_t solexa3781414_1.r1.exp_2 140 no significant hit xx no
52465_t contig93266_1.exp_3 266 start at 70; 3'
incomplete
190 no
52466_t solexa2042653_1.exp_1 195 Incomplete at 5' 170 no
52469_t solexa3731982_1.exp_1 531 5' incomplete at 5' 170 no
52470_t solexa3725514_1.exp_3 263 5' incomplete at 5' and
3'
70 no
52472_t contig107696_1.exp_2 214 5' incomplete at 3' 180 no
52473_t solexa3719256_1.exp_1 258 3' incomplete at 3' 100 no
52474_t solexa3786224_1.exp_1 170 5' incomplete at 5' 160 no
52475_t solexa3716342_1.exp_1 588 5' incomplete at 5' 180 no
52477_t solexa3723855_1.exp_1 149 5' incomplete at 5' 120 no
52479_t solexa3740520_1.f1.exp_2 155 5' incomplete at 5' and
3'
100 no
52480_t solexa3746785_1.f1.exp_1 158 5' incomplete at 3' 95 no
52481_t solexa3750744_1.r1.exp_1 189 5' incomplete at 3' 135 no
52482_t solexa3780638_1.f1.exp_1 248 5' incomplete at 3' 180 no
52483_t solexa3783956_1.exp_1 239 middle incomplete at 5'
3'
30 no
52484_t solexa3781013_1.exp_1 470 5' incomplete at 5' 105 no
52489_t solexa3788148_1.f1.exp_1 444 5' incomplete at 5' 105 SMC
Table S7, related to Figure 3: Rhizaria specific gene families with potential functions. The families were
clustered from the complete B. natans and R. filosa protein sets together with representatives from major
eukaryote branches as in Figure 3.
gene family number of family
members
number of R. filosa
members
with pfam
hits
without
pfam hit
PFAM ID PFAM short
description
ORTHOMCL53 123 122 43 79 PF01535 PPR
ORTHOMCL1552 16 11 9 2 PF00454 PI3_PI4_kinase
ORTHOMCL1558 16 15 9 6 PF00566 TBC
ORTHOMCL2032 14 12 9 3 PF00233 PDEase_I
ORTHOMCL3779 10 9 9 0 PF00233 PDEase_I
ORTHOMCL895 21 20 9 11 PF07534 TLD
ORTHOMCL1555 16 12 7 5 PF00063 Myosin_head
ORTHOMCL3823 10 9 7 2 PF01433 Peptidase_M1
ORTHOMCL4532 9 8 7 1 PF00027 cNMP_binding
ORTHOMCL3837 10 9 6 3 PF00069 Pkinase
ORTHOMCL6349 7 6 6 0 PF00789 UBX
ORTHOMCL1547 16 15 5 10 PF00443 UCH
ORTHOMCL2014 14 13 5 8 PF00622 SPRY
ORTHOMCL6255 7 6 5 1 PF00076 RRM_1
ORTHOMCL7649 6 5 5 0 PF04488 Gly_transf_sug
ORTHOMCL7668 6 5 5 0 PF00520 Ion_trans
ORTHOMCL823 22 21 5 16 PF04969 CS
ORTHOMCL3803 10 7 4 3 PF02434 Fringe
ORTHOMCL4556 8 7 4 3 PF00632 HECT
ORTHOMCL5411 7 4 4 0 PF00397 WW
ORTHOMCL6295 7 5 4 1 PF00018 SH3_1
ORTHOMCL7570 6 5 4 1 PF00069 Pkinase
ORTHOMCL9437 5 4 4 0 PF00069 Pkinase
ORTHOMCL3766 10 9 3 6 PF03016 Exostosin
ORTHOMCL416 32 26 3 23 unspecific
ORTHOMCL6427 6 4 3 1 PF00023 Ank
ORTHOMCL9298 5 4 3 1 PF00233 PDEase_I
ORTHOMCL9342 5 4 3 1 PF00515 TPR_1
ORTHOMCL9388 5 4 3 1 PF09229 Aha1_N
Table S8, related to Figure 4: Potential bacterial horizontal gene transfer into the R. filosa genome. The initially found predicted proteins with
relationship to bacteria were further analysed and proteins with phylogenetically unsure connections were removed. Transcriptional activity was
assessed with the RNAseq data.
R. filosa ID Contig length
B. natans ID description Score e-value Coverage/ comment
Spliced; transcribed
Most closely related bacteria according to phylogeny
strong hits between RF, BN, and bacteria
contig26942_1.exp_8 6897 jgi|Bigna1|89418|estExt_fgenesh1_pg.C_490023
coagulation factor 670 2.7e-97 59; SAR group
yes; yes Sphingomonas
solexa2203579_1.exp_3 4876 jgi|Bigna1|92553|estExt_fgenesh1_pm.C_310025
succinate semialdehyde dehydrogenase
880 5.1e-128 62 yes; no Leptospira
solexa3727020_1.f1.exp_1 3995 jgi|Bigna1|133119|aug1.20_g7827
DNA gyrase subunit B 1095 7.1e-229 56 yes; no Rhodobacter
solexa3737624_1.f1.exp_1 3265 jgi|Bigna1|53113|estExt_Genewise1Plus.C_150167
alanyl-tRNA synthetase 2118 2.7e-229 30 yes; no Magentospirillum
solexa3788450_1.exp_5 7000 jgi|Bigna1|58293|fgenesh1_pm.72_#_8
S-(hydroxymethyl)glutathione dehydrogenase
478 7.5e-44 60 no; yes Bacillus
solexa603061_1.f1.exp_1 3250 jgi|Bigna1|50229|estExt_Genewise1.C_710017
Pyridoxal biosynthesis lyase
996 9.6e-99 148; Additionally in Galdieria
no; no Chloracidobacterium
strong hit between RF and Bn but not so to the bacterial gene
contig00256_1.f1.exp_3 4457 jgi|Bigna1|52767|estExt_Genewise1Plus.C_110075
ATP-dependent protease 331 2.7e-44 54; Additionally in Perkinsus and Polysphondilium
yes; yes Sphigbodium
contig38422_1.exp_1 and contig49540_1.exp_31
3002 and 18324
jgi|Bigna1|92709|estExt_fgenesh1_pm.C_550014
chaperonin GroEL 537 4.2e-50 73; split between two contigs
yes; yes Proteobacteria
contig75576_1.f1.exp_4 5549 jgi|Bigna1|87362|estExt_fgenesh1_pg.C_190162
medium-chain-fatty-acid--CoA ligase
347 4.0e-47 70 yes; yes Euryarchaeota
contig84371_1.exp_5 7986 jgi|Bigna1|92763|estExt_ Alpha-glucosidase 747 2.3e-72 45; B. natans and
yes; yes Proteobacteria
fgenesh1_pm.C_660005 R. filosa do not cluster
solexa3356262_1.exp_7 9042 jgi|Bigna1|140618|aug1.57_g15326
aminopeptidase 569 2.1e-145 142; bacteria and green algae
yes; no Proteobacteria
solexa3669665_1.exp_9 9277 jgi|Bigna1|136701|aug1.35_g11409
hypoxanthine phosphoribosyl-transferase
467 1.1e-42 76 yes; no Ruminococcus
solexa3725098_1.f1.exp_8 5179 jgi|Bigna1|85465|estExt_fgenesh1_pg.C_40142
50S ribosomal protein L1 352 1.7e-30 80 yes; no Proteobacteria
solexa3727961_1.f1.exp_3 9094 jgi|Bigna1|38800|e_gw1.28.155.1
uracil-DNA glycosylase 503 1.7e-46 68 yes; yes Firmicutes
solexa3758563_1.f1.exp_2 5026 jgi|Bigna1|57898|fgenesh1_pm.35_#_10
recombination factor protein
1350 3.0e-136 31 no; no Bacteroidetes
solexa3788306_1.f1.exp_2 4401 jgi|Bigna1|92730|estExt_fgenesh1_pm.C_600006
glyceraldehyde-3-phosphate dehydrogenase
901 1.1e-88 39 yes; no Proteobacteria
strong hits between RF and bacteria only
contig13569_1.f1.exp_4 and 6
6639 None citrate synthase 572 8.2e-54 65 yes; yes Proteobacteria
solexa3725528_1.exp_5 12494 jgi|Bigna1|53997|estExt_Genewise1Plus.C_270061
ATP dependent protease 384 6.9e-34 78 yes; yes Proteobacteria
solexa3775854_1.f1.exp_1 3565 None UDP-N-acetylglucosamine 1-carboxyvinyl-transferase
1163 1.9e-116 31 no; no Proteobacteria
solexa3776817_1.f1.exp_3 17197 None homospermidine synthase
1456 1.7e-147 82 yes; yes Proteobacteria
Table S9, related to Figure 4: Genes in the R. filosa genome with potential associations with
photosynthesis. Proteins needed for photosynthesis were searched for presence or absence in the R.
filosa genome and the transcriptomes of other foraminiferae. Transcriptomic sequences were translated
into amino acid sequences using TranSeq [S1] and blasted against our custom database using BlastP [S2].
For each gene, homologous sequences with e-value lower than 1e-25 were aligned using MAFFT [S3, S4]
and ambiguous positions were discarded using Gblocks [S5]. For each alignment, Maximum Likelihood
phylogenetic analysis was implemented under the PROTCATLGF model in the RAxML-HPC software [S6]
and the reliability of internal branches was assessed using the RAxML rapid bootstrap method with 100
replicates [S7]. Transcriptional activity was assessed with the RNAseq data set.
Gene identifier description e-value % homology transcribed
Cellular component plastid
solexa3744453_1.f1.exp_9 malate dehydrogenase 8,37E-81 70,10% yes
solexa2379099_1.exp_3/4 (Mitochondrial) thylakoid
carrier protein
2,39E-35 56,90% yes
solexa1334423_1.exp_4 cell division cycle 5-like
protein
1,19E-120 69,15% yes
contig78979_1.exp_5 serine threonine-protein
kinase afc3
1,96E-22 63,60% yes
solexa2922822_1.exp_1-3 Serine protease 1,85E-62 60,25% yes
contig43553_1.exp_1 lim-type zinc finger-
containing protein
1,24E-33 63,05% yes
contig43313_1.exp_2 monoglyceride lipase 5,46E-37 61,55% yes
contig20009_1.f1.exp_7 3-ketoacyl- thiolase 3,80E-148 71,95% yes
solexa3788080_1.r1.exp_7 prohibitin 1,14E-103 75,90% yes
solexa3713554_1.exp_3 40s ribosomal protein s9 2,76E-62 81,05% yes
solexa3763975_1.f1.exp ribosomal protein s11 4,88E-56 77,35% Not predicted
solexa2695319_1.r1.exp_6 60s ribosomal protein l10 6,25E-95 75,60% yes
solexa3734361_1.r1.exp_4 beach domain-containing
protein lvsa-like
8,95E-44 73,45% yes
solexa3736631_1.exp_8 glycosyltransferase 3,02E-121 69,75% yes
contig98685_1.exp uncharacterized protein 1,37E-30 65,30% Not predicted
solexa3751523_8.r1.exp 40s ribosomal protein s13 7,30E-64 78,90% no
contig92542_1.f1.exp_3 40s ribosomal protein s23 1,24E-72 84,80% yes
solexa3769848_4.r1.exp_23 adenosine kinase 2 2,33E-119 67,40% yes
solexa3758350_1.exp_1 aspartate aminotransferase 1,37E-87 70,15% no
solexa3761190_2.r1.exp_13 6-phosphogluconate
dehydrogenase
7,58E-173 75,40% no
solexa3770195_1.exp_1 ATP synthase beta subunit 1,55E-122 79,05% yes
solexa1549069_1.exp_6 serine threonine-protein
phosphatase 2a activator-
like
8,68E-41 72,70% yes
solexa3752877_1.f1.exp_5 ATP-dependent
metalloprotease
5,64E-58 66,00% yes
solexa3786327_1.f1.exp_10 RNA-helicase 0 82,85% yes
solexa3784907_1.f1.exp_5 ruvb-like 1-like 1,15E-86 78,55% yes
solexa3798899_2.r1.exp_12/13 kinesin-related protein klpa-
like protein
2,57E-63 69,30% yes
solexa3763509_1.r1.exp_8 phosphoglycerate mutase
det1-like
7,07E-29 57,40% no
solexa3776615_1.f1.exp_3 tyrosine-trna ligase 1,69E-99 78,30% yes
contig00135_1.exp_3/4 cullin 4b 2,83E-39 64,30% yes
solexa3793453_1.exp_2 tubulin alpha-1 chain 2,99E-34 77,25% yes
solexa3763454_1.f1.exp_8/9 mitochondrial processing
peptidase
1,47E-101 61,60% yes
solexa3138172_1.exp_4 rossmann-fold NAD -binding
domain-containing protein
9,83E-12 59,60% yes
solexa3713448_1.f1.exp_1 aldehyde dehydrogenase 7,16E-21 61,45% yes
contig47692_1.f1.exp_8-12 tetratricopeptide repeat
protein
6,16E-51 64,15% no
contig20371_1.r1.exp_14 20s proteasome beta 6
subunit
2,31E-12 62,05% yes
solexa3718015_1.f1.exp_5 h aca ribonucleoprotein
complex subunit 3-like
protein
3,00E-18 76,00% yes
contig80470_1.f1.exp_8 DNA-directed RNA
polymerase
4,40E-77 62,30% yes
solexa3713876_1.exp_1 alpha-tubulin 1,41E-25 83,65% yes
solexa3727856_1.exp_1 alpha tubulin 7,41E-39 83,45% yes
solexa3752366_1.exp_4 beta-actin 3,24E-33 94,30% yes
solexa3734891_1.exp alpha tubulin 1,03E-34 84,65% Not predicted
solexa3726936_1.exp_1/2 ATP synthase cf1 beta
subunit
3,54E-120 79,70% yes
contig24948_1.exp citrate synthase i family
protein
8,28E-70 76,20% Not predicted
Molecular function photosynthesis
contig77077_1.exp_9 cytosolic fructose-1 6-
bisphosphatase
1,84E-95 76,45% yes
solexa3778985_1.exp_3 NADH dehydrogenase 7,80E-69 81,70% yes
Supplemental experimental procedures
Purification from contaminating species
Cell bodies were taken from the culture and washed twice with fresh commercial table water (Volvic)
water. These bodies then were transferred to 10 cm petri dishes containing Volvic water and incubated
for 3 days at room temperature. Thereafter the cells were harvested by centrifugation and reincubated
in Volvic water containing PSN antibiotic mixture (life technologies) over night.
Assessment of contamination with foreign DNA
All contig sequences were screened for presence of potential contaminating sequences from common
freshwater bacteria, protozoa, or higher plants (the food source) using BLAST against the protein refseq
library from NCBI. No contamination of a freshwater species could be detected this way, albeit we
noticed the presence of a genome from a Rickettsia like species with only half the coverage of the
nuclear genome. A further test for the successful purification of nuclei was the screen for likely
mitochondria derived sequences. We found only one raw read which likely stemmed from the
mitochondrial genome.
Search for genes involved in photosynthesis
We searched for genes with photosynthetic affinity in the genome against a database of genes derived
from whole genomes of cyanobacteria, the genome of the secondarily phototrophic B. natans, and the
Arabidopsis nuclear genome. Additionally, transcriptomic data were annotated using Blast2GO [43] and
45 sequences were identified as putatively related to chloroplast maintenance or activity (Table S9).
These sequences were then compared to a comprehensive prokaryotic and eukaryotic database and
rigorous phylogenetic analyses were conducted on each gene separately to identify its precise origin.
While in some cases (10/45), the number of homologous sequences was too low to perform
phylogenetic analyses, all other sequences (35/45) had their highest affinities with other Foraminifera
and/or Rhizaria, which suggested they did not originate from contaminant organisms.
Supplemental References
S1. Rice, P., Longden, I., and Bleasby, A. (2000). EMBOSS: The European Molecular Biology Open
Software Suite. Trends Genet. 16, 276-277.
S2. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990). Basic Local Alignment
Search Tool. J. Mol. Biol. 215, 403-410.
S3. Katoh, K., Kuma, K., Toh, H., and Miyata, T. (2005). MAFFT version 5: improvement in accuracy of
multiple sequence alignment. Nucleic Acids Res. 33, 511-518.
S4. Katoh, K., Misawa, K., Kuma, K., and Miyata, T. (2002). MAFFT: a novel method for rapid multiple
sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059-3066.
S5. Castresana, J. (2000). Selection of conserved blocks from multiple alignment for their use in
phylogenetic analysis. Mol. Biol. Evol. 17, 540-552.
S6. Stamatakis, A. (2006). RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with
thousands of taxa and mixed models. Bioinformatics 22, 2688-2690.
S7. Felsenstein, J. (1985). Confidence limits on phylogenetics: an approach using the bootstrap.
Evolution 39, 783-791.