supplemental information the genome of the … · the genome of the foraminiferan reticulomyxa...

Current Biology, Volume 24

Supplemental Information

The Genome of the Foraminiferan

Reticulomyxa filosa

Gernot Glöckner, Norbert Hülsmann, Michael Schleicher, Angelika A. Noegel,

Ludwig Eichinger, Christoph Gallinger, Jan Pawlowski, Roberto Sierra, Ursula Euteneuer,

Loic Pillet, Ahmed Moustafa, Matthias Platzer, Marco Groth, Karol Szafranski,

and Manfred Schliwa

Inventory of Supplemental Information

Supplemental Figures

Figure S1. Correlation of prediction score, blast score, and transcriptional activity, Related to Table 1

Figure S2. Dynein and Actin phylogenies, Related to Table 2

Figure S3. Phylogenetic relationships of 27 SAR (Stramenopiles, Alveolata, and Rhizaria) and 10 outgroup species (Archaeplastida and Haptophyta) based on 115 genes, Related to Figure 1

Figure S4. Churchill domain occurrence in the eukaryote tree of life, Related to Figure 3

Supplemental Tables

Table S1. Assemblies of raw reads derived from 454/Roche and Illumina sequencing, Related to Table 1

Table S3. The most prominent Pfam domains in the R. filosa genome, Related to Table 1

Table S4. Signaling components in two Rhizaria and N. gruberi , Related to Table 3

Table S5. Proteins with kinesin domains in the R. filosa genome, Related to Table 2

Table S7. Rhizaria specific gene families with potential functions, Related to Figure 3

Table S8. Potential bacterial horizontal gene transfer into the R. filosa genome, Related to Figure 4

Table S9. Genes in the R. filosa genome with potential associations with photosynthesis, Related to Figure 4

Supplemental Experimental Procedures

Supplemental References

Supplemental Figures

Figure S1, related to Table 1: Correlation of prediction score, blast score, and transcriptional activity.

Predicted genes were grouped according to their prediction score calculated in geneid. The groups were

then evaluated according to Blast matches against the complete protein refseq database at NCBI and

whether there are matching reads in the RNAseq data set. The number of genes with a certain Blast

score (red) or prediction score (blue) is given at the Y-axis. The percentage of genes with Blast scores and

the percentage of transcribed genes for each score category are depicted on the secondary axis.

Figure S2, related to Table 2: Dynein and Actin phylogenies.

A: We found 10 different dynein proteins in the predicted gene set, which could all completely be

reconstructed using available similarity information and transcript data. All dyneins could be grouped

into the nine defined categories including the dynein members which are involved in flagellar motility.

The tree topology is based on an ML tree calculated with the LG model with estimated gamma

distribution parameters. Numbers at the branches indicate support from ML calculation/MrBayes

posterior probabilities/ neighbor joining calculation using the poison model. Branches with asterisks

indicate differing topologies between the calculation methods.

B: ML tree of all available Foraminifera actins together with actins from selected species calculated on an

alignment with Gblocks purging. The tree is rooted with an actin-like protein from B. natans. For a better

readability the branches are collapsed to R. filosa and Foram specific clusters. Clusters consist of the

following orders: Foram_1 (Tetromphalus 5, Rosalina 1, Bolivina 2, Bulimina 2, Ammonia 5, Rotaliina 3,

Elphidium 4, Globigerinella 1, Hyalinea 3, Stainforthia 1, Reophax 1, Nonionella 5, Globobuliminia 4);

Foram_2 (Amphisorus 6, Marginopora 4, Miliolidae 4, Miliammina 2); Foram_3 (Bathysiphon 5,

Toxisarcon 1); Foram_4 (Allogromia 6, Bathysiphon 2, Edaphoallogromia 3); Foram_5 (Sorites 4,

Edaphoallogromia 2, Allogromia 3, Trochammina 2, Haynesina 4, Reophax 2); Foram_6 (Ammonia 2,

Tretomphalus 1, Hyalinea 1, Elphidium 2); Foram_7 (Bolivina 1, Bulimina 1); Foram_8 (Tetromphalus 1,

Bolivina 1); Foram_9 (Rosalina 1, Bolivina 1). The MrBayes tree based on the Gblocks alignment yielded

the same clusters of related proteins after 2 million iterations but the chains did not converge further

after 350000 steps. This method thus was unable to resolve the higher order relationships of the

Foraminifera. The neighbor joining tree calculated with the JTT model yielded also results differing from

that shown. The general clustering of related sequences, however, was the same with all methods used.

Especially the clustering of the H. sapiens, D. fasciculatum, A. thaliana, and N. gruberi proteins remained

stable. This analysis thus indicates high divergence of actin sequences in all foraminifera genomes.

Figure S3, related to Figure 1: Phylogenetic relationships of 27 SAR (Stramenopiles, Alveolata

and Rhizaria) and 10 outgroup species (Archaeplastida and Haptophyta) based on 115 genes. The

position of R. filosa is highlighted. The tree was obtained as the highest scoring maximum likelihood tree

using LG+Γ model and empirical amino acid frequencies. The numbers at nodes indicate the topological

support estimated by 1000 bootstrap replicates and Bayesian consensus posterior probabilities of post-

burnin bipartitions. Solid circles represent maximum support.

Figure S4, related to Figure 3: Churchill domain occurrence in the eukaryote tree of life. The tree

represents the currently accepted phylogeny of eukaryotes. Dashed lines indicate unsure relationships.

The occurrence of churchill proteins is depicted as circles above the respective branches. Black filled

circles indicate presence in all or most currently known genomes, open circles indicate absence. The

Ecdysozoa in the Metazoa lineage lost this domain. Grey in Amoebozoa shows spurious occurrence (only

1 of 9 completely sequenced genomes contains this domain). SAR: Stramenopiles, Alveolata, and

Rhizaria.

Tables

Table S1, related to Table 1: Assemblies of raw reads derived from 454/Roche and Illumina

sequencing. The Illumina raw reads amounted to 22.9 GB and the 454 reads to 1.6 GB. The assemblies

with newbler were done with the 454 reads only and the ABySS assemblies with those from the Illumina

sequencer. All other assemblies made use of both resources. Duplicon removal was done with inhouse

software, while the diginorm assembly made use of a normalization procedure implemented in the

diginorm package (http://ged.msu.edu/papers/2012-diginorm/). The merged assembly is derived from a

manually curated merger of the Newbler assembly with the ABySS 500 assembly. This merged assembly

was used for further analysis.

Software threshold number of

contigs

total size of

contigs

Mean length

of contig

CLC 200 149,873 114,102,752 761

ABySS 200 126,521 119,674,360 946

ABySS 500 75,748 102,121,417 1,348

Newbler -- 124,291 94,262,983 758

Newbler 500 47,701 73,360,904 1,538

CLC with removal of duplicons 200 64,080 94,361,471 1,473

CLCdiginorm normalized reads 200 66,089 91,058,782 1,378

Merged assembly 500 45,292 100,460,861 2,215

Table S3, related to Table 1: The most prominent Pfam domains in the R. filosa genome.

# of

detected

Pfams

ID IPR_number name GO

11464 PF00400 IPR019781 WD40 repeat, subgroup

964 PF00515 IPR001440 Tetratricopeptide TPR-1 Molecular Function: protein binding (GO:0005515)

598 PF00069 IPR017442 Serine/threonine-protein

kinase-like domain

Molecular Function: protein kinase activity (GO:0004672),

Molecular Function: ATP binding (GO:0005524), Biological

Process: protein phosphorylation (GO:0006468)

265 PF01302 IPR000938 Cytoskeleton-associated

protein, Gly-rich domain

231 PF02176 -- TRAF type zinc finger

194 PF00503 IPR001019 Guanine nucleotide binding

protein (G-protein), alpha

subunit

Molecular Function: signal transducer activity (GO:0004871),

Biological Process: G-protein coupled receptor protein

signaling pathway (GO:0007186), Molecular Function: guanyl

nucleotide binding (GO:0019001)

193 PF00071 IPR013753 Ras

186 PF05729 -- NACHT domain

184 PF00023 IPR002110 Ankyrin repeat Molecular Function: protein binding (GO:0005515)

180 PF01344 IPR006652 Kelch repeat type 1 Molecular Function: protein binding (GO:0005515)

172 PF00076 IPR000504 RNA recognition motif domain Molecular Function: nucleic acid binding (GO:0003676)

113 PF00443 IPR001394 Peptidase C19, ubiquitin

carboxyl-terminal hydrolase 2

Molecular Function: ubiquitin thiolesterase activity

(GO:0004221), Biological Process: ubiquitin-dependent

protein catabolic process (GO:0006511)

110 PF01535 IPR002885 Pentatricopeptide repeat

106 PF00225 IPR001752 Kinesin, motor domain Molecular Function: microtubule motor activity

(GO:0003777), Molecular Function: ATP binding

(GO:0005524), Biological Process: microtubule-based

movement (GO:0007018)

99 PF00805 IPR001646 Pentapeptide repeat

96 PF00091 IPR003008 Tubulin/FtsZ, GTPase domain Cellular Component: protein complex (GO:0043234),

Biological Process: protein polymerization (GO:0051258)

96 PF00271 IPR001650 Helicase, C-terminal Molecular Function: nucleic acid binding (GO:0003676),

Molecular Function: helicase activity (GO:0004386),

Molecular Function: ATP binding (GO:0005524)

95 PF00169 IPR001849 Pleckstrin homology domain Molecular Function: protein binding (GO:0005515)

94 PF00036 IPR018248 EF-hand

90 PF00004 IPR003959 ATPase, AAA-type, core Molecular Function: ATP binding (GO:0005524)

Table S4, related to Table 3: Signaling components in two Rhizaria and N. gruberi.

R. filosa B. natans N. gruberi Pfam domain description

Cyclic nucleotide

34 11 121 PF00211 Adenylyl cyclase class-3/4/guanylyl cyclase

39 14 7 PF00233 3'5'-cyclic nucleotide phosphodiesterase,

75 56 10 PF00027 Cyclic nucleotide-binding domain

PIP signalling

40 32 15 PF00454 Phosphatidylinositol 3-/4-kinase, catalytic

31 24 8 PF01504 Phosphatidylinositol-4-phosphate 5-kinase,

14 5 12 PF00613 Phosphoinositide 3-kinase, accessory (PIK)

0 1 3 PF00387 Phospholipase C, phosphatidylinositol-specific,

0 2 6 PF00388 Phospholipase C, phosphatidylinositol-specific ,

Calcium signalling

69 72 88 PF00168 C2 calcium-dependent membrane targeting

94 81 57 PF00036 EF-hand

22 26 37 PF00122 ATPase, P-type, ATPase-associated domain

3 22 23 PF00612 IQ motif, EF-hand binding site

10 20 12 PF01699 Sodium/calcium exchanger membrane region

Heterotrimeric G

154 40 60 PF00503 Guanine nucleotide binding protein (G-

82 34 231 PF00615 Regulator of G protein signalling

small G proteins

193 83 233 PF00071 Ras

9 2 10 PF00616 Ras GTPase-activating protein

49 8 29 PF00617 Guanine-nucleotide dissociation stimulator

8 1 15 PF00618 Ras-like guanine nucleotide exchange factor, N-

25 31 36 PF00025 ARF/SAR superfamily

20 17 8 PF01412 Arf GTPase activating protein

15 5 4 PF01369 SEC7-like

43 43 28 PF00621 Dbl homology (DH) domain

34 28 25 PF00620 Rho GTPase-activating protein domain

3 5 4 PF02263 Guanylate-binding protein, N-terminal

22 37 25 PF01926 GTP-binding domain, HSR1-related

Phosphate

598 353 463 PF00069 Serine/threonine-protein kinase-like domain

34 81 20 PF07714 Serine-threonine/tyrosine-protein kinase

50 60 46 PF00149 Metallo-dependent phosphatase

70 50 38 PF03372 Endonuclease/exonuclease/phosphatase

42 49 40 PF00782 Dual specificity phosphatase, catalytic domain

6 10 7 PF00328 Histidine phosphatase superfamily, clade-2

12 8 6 PF00244 14-3-3 domain

11 43 11 PF00498 Forkhead-associated (FHA) domain

Histidine kinase

6 8 32 PF00072 Signal transduction response regulator,

4 2 27 PF00512 Signal transduction histidine kinase, subgroup

Sensors

1 9 61 PF00989 PAS fold

0 0 5 PF08376 Nitrate/nitrite sensing protein

0 11 5 PF04940 BLUF

Table S5, related to Table 2: Proteins with kinesin domains in the R. filosa genome. The R. filosa

genome contains 88 domains, which match at least partly a kinesin domain architecture. A large number

of the kinesin domain containing proteins are likely pseudogenes. Pseudogenization was assumed to

have happened if i) the domain was non-functional due to missing domain parts despite residing in an

apparently coding region, ii) the domain resides in an otherwise low complexity region without splice

signals for proper excision of introns, iii) unspliced transcripts covering small (20-40 bases) predicted

intron regions, iv) presence of domains associated with transposon activity.

Domain ID gene name Protein

length

localization of domain length of kinesin

domain

additional domain

52415_t solexa3720325_2.r1.exp_7 1032 3' complete 330 SMC; 5' of kinesin

52423_t contig21830_1.exp_5 746 3' complete 290 no

52476_t solexa3169777_1.exp_3 292 3' complete 230 no

52408_t contig09993_1.exp_11 650 3' complete 360 SMC; 5' of kinesin

52445_t solexa3798899_2.r1.exp_12 306 middle; complete 255 no

52446_t solexa3758974_7.r1.exp_13 997 5' complete 380 SMC

52459_t contig78665_1.exp_5 1025 5' complete 400 SMC

52402_t contig26551_1.exp_7 726 5' complete; 260 SMC

52417_t contig18412_1.exp_2 692 5' incomplete at 5' 300 no

52436_t contig30408_1.exp_1 1087 middle; complete start

at aa 460

340 SMC; 5' of kinesin

52438_t solexa3784960_1.f1.exp_2 649 3' complete; start at aa

320

220 SMC; additional weak

kinesin hit 5' 52434_t solexa3778539_1.f1.exp_1 636 5' complete 240 No

52449_t solexa3808122_1.exp_3 1177 3' complete; start at 630 360 no

52428_t contig24557_1.f1.exp_2 852 middle; complete start

at aa 150

280 no

52404_t solexa3725653_1.f1.exp_14 223 5' complete 223 no

52409_t solexa3777916_1.exp_9 527 middle; complete start

at aa 75

315 no

52433_t solexa3746604_1.exp_2 419 middle; complete start

at aa 90

300 no

52427_t contig24185_1.f1.exp_6 1376 5' complete 360 UBQ (ubiquitin

homologs) 52488_t contig40075_1.exp_1 556 5' complete 435 no

52463_t solexa1637608_1.r1.exp_3 905 5' complete 220 no

52412_t solexa3718088_1.f1.exp_9 842 5' complete 375 SMC

52457_t solexa3769552_1.exp_4 920 5' complete 275 SMC



52462_t solexa3734740_1.f1.exp_2 240 3' incomplete at 3' 195 no

52431_t contig25635_1.f1.exp_3 485 middle complete 335 no

52448_t contig39107_1.exp_8 292 middle; complete 250 no

52456_t contig75133_1.f1.exp_7 1167 5' complete 265 SMC; BAH

52442_t solexa2602295_1.r1.exp_1 398 complete 280 no


52405_t solexa2052427_1.exp_1 1249 middle incomplete at 5'

start at aa 200

200 reverse transcriptase

domain 52425_t contig23116_1.f1.exp_4 660 5' complete 360 no

52450_t solexa1407312_1.exp_3 625 5' complete; start at aa

30

300 no

52478_t solexa3798613_1.exp_2 311 5' incomplete at 3' 200 no



52467_t contig98232_1.exp_1 329 start at 65; 3'

incomplete

210 no

52468_t contig96868_1.exp_1 1829 middle incomplete at 3',

starting at aa 250

250 reverse transcriptase

domain 52471_t solexa3789411_1.f1.exp_1 578 complete, starting at aa

100

265 no


52458_t solexa3753414_1.f1.exp_9 860 5' complete 375 SMC





52437_t contig28647_1.exp_4 681 5' complete 470 SMC

52444_t contig33457_1.exp_2 297 complete 290 no

52464_t contig92438_1.f1.exp_1 518 3' incomplete at 3' 248 no

52403_t solexa3756928_1.r1.exp_16 714 5' complete; start at aa

40

240 no

52461_t solexa3777157_1.f1.exp_2 217 incomplete at 5' and 3' 217 no


52406_t solexa3746845_1.r1.exp_1 198 5' incomplete at 5'3 140 no


52410_t solexa3729075_1.exp_2 555 3' incomplete at 3'; start

at aa 375

125 no





52419_t solexa902554_1.f1.exp_2 177 3' incomplete at 3' start

at aa 35

110 no

52420_t solexa3746433_8.r1.exp_7 309 xx

52421_t solexa3724209_1.r1.exp_1 1036 5' incomplete at 5' 75 SMC


52424_t contig22600_1.exp_2 698 5' incomplete at 5' 140 SMC

52429_t solexa3808444_1.exp_4 146 5' incomplete at 5' and

3'

95 no







5'

65 no


3'

140 no

52455_t solexa3781414_1.r1.exp_2 140 no significant hit xx no

52465_t contig93266_1.exp_3 266 start at 70; 3'

incomplete

190 no

52466_t solexa2042653_1.exp_1 195 Incomplete at 5' 170 no



3'

70 no






52479_t solexa3740520_1.f1.exp_2 155 5' incomplete at 5' and

3'

100 no


52481_t solexa3750744_1.r1.exp_1 189 5' incomplete at 3' 135 no


52483_t solexa3783956_1.exp_1 239 middle incomplete at 5'

3'

30 no


52489_t solexa3788148_1.f1.exp_1 444 5' incomplete at 5' 105 SMC

Table S7, related to Figure 3: Rhizaria specific gene families with potential functions. The families were

clustered from the complete B. natans and R. filosa protein sets together with representatives from major

eukaryote branches as in Figure 3.

gene family number of family

members

number of R. filosa

members

with pfam

hits

without

pfam hit

PFAM ID PFAM short

description

ORTHOMCL53 123 122 43 79 PF01535 PPR

ORTHOMCL1552 16 11 9 2 PF00454 PI3_PI4_kinase

ORTHOMCL1558 16 15 9 6 PF00566 TBC

ORTHOMCL2032 14 12 9 3 PF00233 PDEase_I


ORTHOMCL895 21 20 9 11 PF07534 TLD

ORTHOMCL1555 16 12 7 5 PF00063 Myosin_head

ORTHOMCL3823 10 9 7 2 PF01433 Peptidase_M1

ORTHOMCL4532 9 8 7 1 PF00027 cNMP_binding

ORTHOMCL3837 10 9 6 3 PF00069 Pkinase

ORTHOMCL6349 7 6 6 0 PF00789 UBX

ORTHOMCL1547 16 15 5 10 PF00443 UCH

ORTHOMCL2014 14 13 5 8 PF00622 SPRY

ORTHOMCL6255 7 6 5 1 PF00076 RRM_1

ORTHOMCL7649 6 5 5 0 PF04488 Gly_transf_sug

ORTHOMCL7668 6 5 5 0 PF00520 Ion_trans

ORTHOMCL823 22 21 5 16 PF04969 CS

ORTHOMCL3803 10 7 4 3 PF02434 Fringe

ORTHOMCL4556 8 7 4 3 PF00632 HECT

ORTHOMCL5411 7 4 4 0 PF00397 WW

ORTHOMCL6295 7 5 4 1 PF00018 SH3_1



ORTHOMCL3766 10 9 3 6 PF03016 Exostosin

ORTHOMCL416 32 26 3 23 unspecific

ORTHOMCL6427 6 4 3 1 PF00023 Ank


ORTHOMCL9342 5 4 3 1 PF00515 TPR_1

ORTHOMCL9388 5 4 3 1 PF09229 Aha1_N

Table S8, related to Figure 4: Potential bacterial horizontal gene transfer into the R. filosa genome. The initially found predicted proteins with

relationship to bacteria were further analysed and proteins with phylogenetically unsure connections were removed. Transcriptional activity was

assessed with the RNAseq data.

R. filosa ID Contig length

B. natans ID description Score e-value Coverage/ comment

Spliced; transcribed

Most closely related bacteria according to phylogeny

strong hits between RF, BN, and bacteria

contig26942_1.exp_8 6897 jgi|Bigna1|89418|estExt_fgenesh1_pg.C_490023

coagulation factor 670 2.7e-97 59; SAR group

yes; yes Sphingomonas

solexa2203579_1.exp_3 4876 jgi|Bigna1|92553|estExt_fgenesh1_pm.C_310025

succinate semialdehyde dehydrogenase

880 5.1e-128 62 yes; no Leptospira

solexa3727020_1.f1.exp_1 3995 jgi|Bigna1|133119|aug1.20_g7827

DNA gyrase subunit B 1095 7.1e-229 56 yes; no Rhodobacter

solexa3737624_1.f1.exp_1 3265 jgi|Bigna1|53113|estExt_Genewise1Plus.C_150167

alanyl-tRNA synthetase 2118 2.7e-229 30 yes; no Magentospirillum

solexa3788450_1.exp_5 7000 jgi|Bigna1|58293|fgenesh1_pm.72_#_8

S-(hydroxymethyl)glutathione dehydrogenase

478 7.5e-44 60 no; yes Bacillus

solexa603061_1.f1.exp_1 3250 jgi|Bigna1|50229|estExt_Genewise1.C_710017

Pyridoxal biosynthesis lyase

996 9.6e-99 148; Additionally in Galdieria

no; no Chloracidobacterium

strong hit between RF and Bn but not so to the bacterial gene

contig00256_1.f1.exp_3 4457 jgi|Bigna1|52767|estExt_Genewise1Plus.C_110075

ATP-dependent protease 331 2.7e-44 54; Additionally in Perkinsus and Polysphondilium

yes; yes Sphigbodium

contig38422_1.exp_1 and contig49540_1.exp_31

3002 and 18324

jgi|Bigna1|92709|estExt_fgenesh1_pm.C_550014

chaperonin GroEL 537 4.2e-50 73; split between two contigs

yes; yes Proteobacteria

contig75576_1.f1.exp_4 5549 jgi|Bigna1|87362|estExt_fgenesh1_pg.C_190162

medium-chain-fatty-acid--CoA ligase

347 4.0e-47 70 yes; yes Euryarchaeota

contig84371_1.exp_5 7986 jgi|Bigna1|92763|estExt_ Alpha-glucosidase 747 2.3e-72 45; B. natans and

yes; yes Proteobacteria

fgenesh1_pm.C_660005 R. filosa do not cluster

solexa3356262_1.exp_7 9042 jgi|Bigna1|140618|aug1.57_g15326

aminopeptidase 569 2.1e-145 142; bacteria and green algae

yes; no Proteobacteria

solexa3669665_1.exp_9 9277 jgi|Bigna1|136701|aug1.35_g11409

hypoxanthine phosphoribosyl-transferase

467 1.1e-42 76 yes; no Ruminococcus

solexa3725098_1.f1.exp_8 5179 jgi|Bigna1|85465|estExt_fgenesh1_pg.C_40142

50S ribosomal protein L1 352 1.7e-30 80 yes; no Proteobacteria

solexa3727961_1.f1.exp_3 9094 jgi|Bigna1|38800|e_gw1.28.155.1

uracil-DNA glycosylase 503 1.7e-46 68 yes; yes Firmicutes

solexa3758563_1.f1.exp_2 5026 jgi|Bigna1|57898|fgenesh1_pm.35_#_10

recombination factor protein

1350 3.0e-136 31 no; no Bacteroidetes

solexa3788306_1.f1.exp_2 4401 jgi|Bigna1|92730|estExt_fgenesh1_pm.C_600006

glyceraldehyde-3-phosphate dehydrogenase

901 1.1e-88 39 yes; no Proteobacteria

strong hits between RF and bacteria only

contig13569_1.f1.exp_4 and 6

6639 None citrate synthase 572 8.2e-54 65 yes; yes Proteobacteria

solexa3725528_1.exp_5 12494 jgi|Bigna1|53997|estExt_Genewise1Plus.C_270061

ATP dependent protease 384 6.9e-34 78 yes; yes Proteobacteria

solexa3775854_1.f1.exp_1 3565 None UDP-N-acetylglucosamine 1-carboxyvinyl-transferase

1163 1.9e-116 31 no; no Proteobacteria

solexa3776817_1.f1.exp_3 17197 None homospermidine synthase

1456 1.7e-147 82 yes; yes Proteobacteria

Table S9, related to Figure 4: Genes in the R. filosa genome with potential associations with

photosynthesis. Proteins needed for photosynthesis were searched for presence or absence in the R.

filosa genome and the transcriptomes of other foraminiferae. Transcriptomic sequences were translated

into amino acid sequences using TranSeq [S1] and blasted against our custom database using BlastP [S2].

For each gene, homologous sequences with e-value lower than 1e-25 were aligned using MAFFT [S3, S4]

and ambiguous positions were discarded using Gblocks [S5]. For each alignment, Maximum Likelihood

phylogenetic analysis was implemented under the PROTCATLGF model in the RAxML-HPC software [S6]

and the reliability of internal branches was assessed using the RAxML rapid bootstrap method with 100

replicates [S7]. Transcriptional activity was assessed with the RNAseq data set.

Gene identifier description e-value % homology transcribed

Cellular component plastid

solexa3744453_1.f1.exp_9 malate dehydrogenase 8,37E-81 70,10% yes

solexa2379099_1.exp_3/4 (Mitochondrial) thylakoid

carrier protein

2,39E-35 56,90% yes

solexa1334423_1.exp_4 cell division cycle 5-like

protein

1,19E-120 69,15% yes

contig78979_1.exp_5 serine threonine-protein

kinase afc3

1,96E-22 63,60% yes

solexa2922822_1.exp_1-3 Serine protease 1,85E-62 60,25% yes

contig43553_1.exp_1 lim-type zinc finger-

containing protein

1,24E-33 63,05% yes

contig43313_1.exp_2 monoglyceride lipase 5,46E-37 61,55% yes

contig20009_1.f1.exp_7 3-ketoacyl- thiolase 3,80E-148 71,95% yes

solexa3788080_1.r1.exp_7 prohibitin 1,14E-103 75,90% yes

solexa3713554_1.exp_3 40s ribosomal protein s9 2,76E-62 81,05% yes

solexa3763975_1.f1.exp ribosomal protein s11 4,88E-56 77,35% Not predicted

solexa2695319_1.r1.exp_6 60s ribosomal protein l10 6,25E-95 75,60% yes

solexa3734361_1.r1.exp_4 beach domain-containing

protein lvsa-like

8,95E-44 73,45% yes

solexa3736631_1.exp_8 glycosyltransferase 3,02E-121 69,75% yes

contig98685_1.exp uncharacterized protein 1,37E-30 65,30% Not predicted

solexa3751523_8.r1.exp 40s ribosomal protein s13 7,30E-64 78,90% no

contig92542_1.f1.exp_3 40s ribosomal protein s23 1,24E-72 84,80% yes

solexa3769848_4.r1.exp_23 adenosine kinase 2 2,33E-119 67,40% yes

solexa3758350_1.exp_1 aspartate aminotransferase 1,37E-87 70,15% no

solexa3761190_2.r1.exp_13 6-phosphogluconate

dehydrogenase

7,58E-173 75,40% no

solexa3770195_1.exp_1 ATP synthase beta subunit 1,55E-122 79,05% yes

solexa1549069_1.exp_6 serine threonine-protein

phosphatase 2a activator-

like

8,68E-41 72,70% yes

solexa3752877_1.f1.exp_5 ATP-dependent

metalloprotease

5,64E-58 66,00% yes

solexa3786327_1.f1.exp_10 RNA-helicase 0 82,85% yes

solexa3784907_1.f1.exp_5 ruvb-like 1-like 1,15E-86 78,55% yes

solexa3798899_2.r1.exp_12/13 kinesin-related protein klpa-

like protein

2,57E-63 69,30% yes

solexa3763509_1.r1.exp_8 phosphoglycerate mutase

det1-like

7,07E-29 57,40% no

solexa3776615_1.f1.exp_3 tyrosine-trna ligase 1,69E-99 78,30% yes

contig00135_1.exp_3/4 cullin 4b 2,83E-39 64,30% yes

solexa3793453_1.exp_2 tubulin alpha-1 chain 2,99E-34 77,25% yes

solexa3763454_1.f1.exp_8/9 mitochondrial processing

peptidase

1,47E-101 61,60% yes

solexa3138172_1.exp_4 rossmann-fold NAD -binding

domain-containing protein

9,83E-12 59,60% yes

solexa3713448_1.f1.exp_1 aldehyde dehydrogenase 7,16E-21 61,45% yes

contig47692_1.f1.exp_8-12 tetratricopeptide repeat

protein

6,16E-51 64,15% no

contig20371_1.r1.exp_14 20s proteasome beta 6

subunit

2,31E-12 62,05% yes

solexa3718015_1.f1.exp_5 h aca ribonucleoprotein

complex subunit 3-like

protein

3,00E-18 76,00% yes

contig80470_1.f1.exp_8 DNA-directed RNA

polymerase

4,40E-77 62,30% yes

solexa3713876_1.exp_1 alpha-tubulin 1,41E-25 83,65% yes

solexa3727856_1.exp_1 alpha tubulin 7,41E-39 83,45% yes

solexa3752366_1.exp_4 beta-actin 3,24E-33 94,30% yes

solexa3734891_1.exp alpha tubulin 1,03E-34 84,65% Not predicted

solexa3726936_1.exp_1/2 ATP synthase cf1 beta

subunit

3,54E-120 79,70% yes

contig24948_1.exp citrate synthase i family

protein

8,28E-70 76,20% Not predicted

Molecular function photosynthesis

contig77077_1.exp_9 cytosolic fructose-1 6-

bisphosphatase

1,84E-95 76,45% yes

solexa3778985_1.exp_3 NADH dehydrogenase 7,80E-69 81,70% yes

Supplemental experimental procedures

Purification from contaminating species

Cell bodies were taken from the culture and washed twice with fresh commercial table water (Volvic)

water. These bodies then were transferred to 10 cm petri dishes containing Volvic water and incubated

for 3 days at room temperature. Thereafter the cells were harvested by centrifugation and reincubated

in Volvic water containing PSN antibiotic mixture (life technologies) over night.

Assessment of contamination with foreign DNA

All contig sequences were screened for presence of potential contaminating sequences from common

freshwater bacteria, protozoa, or higher plants (the food source) using BLAST against the protein refseq

library from NCBI. No contamination of a freshwater species could be detected this way, albeit we

noticed the presence of a genome from a Rickettsia like species with only half the coverage of the

nuclear genome. A further test for the successful purification of nuclei was the screen for likely

mitochondria derived sequences. We found only one raw read which likely stemmed from the

mitochondrial genome.

Search for genes involved in photosynthesis

We searched for genes with photosynthetic affinity in the genome against a database of genes derived

from whole genomes of cyanobacteria, the genome of the secondarily phototrophic B. natans, and the

Arabidopsis nuclear genome. Additionally, transcriptomic data were annotated using Blast2GO [43] and

45 sequences were identified as putatively related to chloroplast maintenance or activity (Table S9).

These sequences were then compared to a comprehensive prokaryotic and eukaryotic database and

rigorous phylogenetic analyses were conducted on each gene separately to identify its precise origin.

While in some cases (10/45), the number of homologous sequences was too low to perform

phylogenetic analyses, all other sequences (35/45) had their highest affinities with other Foraminifera

and/or Rhizaria, which suggested they did not originate from contaminant organisms.

Supplemental References

S1. Rice, P., Longden, I., and Bleasby, A. (2000). EMBOSS: The European Molecular Biology Open

Software Suite. Trends Genet. 16, 276-277.

S2. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990). Basic Local Alignment

Search Tool. J. Mol. Biol. 215, 403-410.

S3. Katoh, K., Kuma, K., Toh, H., and Miyata, T. (2005). MAFFT version 5: improvement in accuracy of

multiple sequence alignment. Nucleic Acids Res. 33, 511-518.

S4. Katoh, K., Misawa, K., Kuma, K., and Miyata, T. (2002). MAFFT: a novel method for rapid multiple

sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059-3066.

S5. Castresana, J. (2000). Selection of conserved blocks from multiple alignment for their use in

phylogenetic analysis. Mol. Biol. Evol. 17, 540-552.

S6. Stamatakis, A. (2006). RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with

thousands of taxa and mixed models. Bioinformatics 22, 2688-2690.

S7. Felsenstein, J. (1985). Confidence limits on phylogenetics: an approach using the bootstrap.

Evolution 39, 783-791.

supplemental information the genome of the … · the genome of the foraminiferan reticulomyxa...

Documents