application of bioinformatics tools to microbial...

27
Blandine Trouche 5 th year Biochemical Engineering Computational Biology INSA Toulouse Application of bioinformatics tools to microbial ecology 22/02/2016 to 05/08/2016 Department of Microbiology and Immunology, University of Otago, Dunedin, New Zealand Supervisors: Sergio Morales Claire Moulis

Upload: others

Post on 05-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

Blandine Trouche 5th year Biochemical Engineering –

Computational Biology INSA Toulouse

Application of bioinformatics tools to microbial ecology

22/02/2016 to 05/08/2016

Department of Microbiology and Immunology, University of Otago, Dunedin, New Zealand

Supervisors:

Sergio Morales

Claire Moulis

Page 2: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

I would like to thank Mr Sergio Morales for welcoming me into his team at the Department of Microbiology and Immunology of the University of Otago, New Zealand. This was a great opportunity for me, and he managed to be helpful while still giving me autonomy and freedom. Thank you also to Mr Federico Baltar and his team for trusting me with their project. Thanks to Ms Xochitl Morgan and Mr Ambarish Biswas for their advice and help with technical issues. And finally thank you to Mr Matthew Highton, Ms Rachel Kaminsky and Mr Sainur Samad for making me feel welcome in the lab.

Page 3: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

Table of abbreviations:

- NGS: next generation sequencing

- psu: practical salinity unit. 1 psu = 1g of salt per sea water kg Most of the oceans water have salinity between 34 and 35.

- SW: subtropical waters

- SAW: sub-Antarctic waters

- STF: subtropical front

- HNLC: High Chlorophyll Low Nutrient

- OTU: Operational Taxonomic Unit

Page 4: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

ABSTRACT .................................................................................................................................... 1

I- GENERAL INTRODUCTION TO THE INTERNSHIP ...................................................................... 2

1) HOSTING ESTABLISHMENT: MORALES LAB, UNIVERSITY OF OTAGO .......................................................... 2 2) AREA OF WORK: MICROBIAL ECOLOGY ................................................................................................. 2 3) PROJECTS WORKED ON ...................................................................................................................... 3

II- BINNING OF OCEAN SAMPLES ............................................................................................... 4

1) BIBLIOGRAPHY ................................................................................................................................. 4 a. The Munida transect ............................................................................................................... 4 b. Binning: a new approach ......................................................................................................... 6

2) MATERIAL & METHODS .................................................................................................................... 9 a. Samples ................................................................................................................................... 9 b. Coassembly of the reads and mapping ................................................................................. 10 c. Binning ................................................................................................................................... 10 d. Quality check and taxonomic assignment of the bins ........................................................... 10

3) RESULTS AND DISCUSSION ............................................................................................................... 10

III- SOIL PROJECTS ................................................................................................................ 14

1) STUDYING SOIL COMMUNITIES OF THE SOUTH ISLAND OF NEW ZEALAND ................................................ 14 a. Introduction to the project .................................................................................................... 14 b. Using GraPhlAn to present the results .................................................................................. 15

2) COMPARING MICROBIAL COMMUNITIES FROM DIFFERENT ENVIRONMENTS IN PAPUA NEW GUINEA ............ 16 a. Introduction to the project .................................................................................................... 16 b. Using Graphlan and step plots to present the results ........................................................... 18

CONCLUSION .............................................................................................................................. 20

BIBLIOGRAPHICAL REFERENCES .................................................................................................. 21

ANNEX: ...................................................................................................................................... 22

BINNING ON BB1 SERVER ......................................................................................................................... 22

Page 5: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

1

Abstract

Microbial ecology deals with the relationship between microorganisms and their environment. In many types of ecosystems such as ocean or soil samples, most of the microorganisms present are unknown or poorly characterized. The recent development of new biotechnologies such as Next Generation Sequencing enables scientists to generate bigger and bigger amounts of data, creating a need for tools capable of handling and representing these data. During this internship, I have focused on troubleshooting new bioinformatics tools for different microbial ecology projects.

I have firstly attempted to apply binning to sequencing data from ocean samples. Binning is a relatively new technique that aims at grouping reads based on certain characteristics of the data such as coverage and GC content, thus trying to recover complete or partial genomes. This method was applied to two sets of data with different technical characteristics. Though the results of this particular experiment were not very conclusive due to limitations both on the data and the tools used, this technique could be useful on different projects or in the future when it is further developed.

I have also collaborated on two projects studying soil microbial communities, one in the South Island of New Zealand, the other one in Papua New Guinea. The tool mostly used was Graphlan, a graphical software representing phylogenetic trees and associating species to chosen features. Having mastered the use of the tool, I was able to design the clearest possible figures to illustrate the points being made by the team.

La microbiologie environnementale s’intéresse aux relations entre les microorganismes et leur environnement. Dans certains types d’écosystèmes tels que l’océan ou le sol, la plupart des microorganismes présents sont peu ou pas connus. Le développement récent de nouvelles biotechnologies telles que le Séquençage de nouvelle génération permet aux scientifiques de générer de plus en plus de données, créant ainsi un besoin d’outils capables de traiter ces données. Au cours de ce stage, je me suis intéressée à différents outils de bioinformatique et à leur application à trois projets d’écologie microbienne. J’ai tout d’abord appliqué la technique du binning à des données de séquençage issues d’échantillons marins. Le binning est une technique relativement récente qui cherche à regrouper les séquences (reads) en se basant sur certaines propriétés telles que le pourcentage de GC ou la profondeur, afin d’obtenir des génomes complets ou partiels. Cette méthode a été appliquée à deux lots d’échantillons possédant des caractéristiques différentes. Les résultats de cette expérience n’ont pas été aussi concluants qu’espéré à cause de limitations sur les données mais également intrinsèques aux outils utilisés. Cependant, cette technique pourrait se révéler utile pour d’autres types de projet ou après un développement plus poussé. J’ai également apporté mon aide sur deux projets étudiant les caractéristiques de communautés microbiennes issues du sol, de l’île du Sud de Nouvelle-Zélande pour l’un, de Papouasie Nouvelle-Guinée pour l’autre. Le principal outil utilisé a été Graphlan, logiciel graphique permettant de représenter un arbre phylogénétique auquel sont associées des propriétés données. Après avoir maîtrisé cet outil, j’ai pu m’en servir pour illustrer de la manière la plus claire possible les résultats de ces deux équipes.

Page 6: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

2

I- General introduction to the internship

1) Hosting establishment: Morales Lab, University of Otago During this internship I have been welcomed in Sergio Morales’s team, at the Department of

Microbiology and Immunology of the University of Otago.

The University is located in the city of Dunedin, in the South of New Zealand. Founded in 1869, it is the oldest university in the country. Originally authorized to grant degrees in Arts, Medicine, Law and Music, today it welcomes around 20,000 students enrolled in over 190 undergraduate and postgraduate programmes in a number of fields. Although its main campus is based in Dunedin, the university also has campuses in Wellington, Christchurch, Invercargill and a teaching centre in Auckland.

The University of Otago is structured in divisions, which are then subdivided into Departments. The Department of Microbiology and Immunology belongs to the division of Health Sciences, and is a teaching place as well as a research facility. It provides undergraduate courses and opportunities for Master’s degrees and PhDs, which means that research activities are usually an important part of a student’s curriculum. Numerous students are hosted in the different labs of the department and conduct research projects over the course of a semester or year.

The Department itself is divided into labs bearing the name of the principal researcher. There are 20 of these labs at the moment, focusing on various subjects in Microbiology, Immunology and Virology. I have been working in the Morales Lab, which is under the direction of Sergio Morales, microbiologist and teaching fellow. Its other members are Ambarish Biswas (postdoctoral fellow), Matthew Highton, Rachel Kaminsky and Sainur Samad (PhD students).

The lab’s work is dedicated to microbial ecology: analysing environmental communities to understand the microorganisms’ behaviour and impact on an ecosystem to be able, in time, to use the discovered properties or predict how an ecosystem will react to environmental changes. For example, the research areas of the lab include investigating community structure of environmental samples, nitrogen cycling and greenhouse gas emissions in soils and primary productivity in the ocean.

To achieve these research goals, Morales lab is in close interaction with Morgan lab from the same department, and Federico Baltar’s team from the Marine Science Department. Mr Baltar collaborates on the marine microbiology projects, with sampling and lab work often the part of his team while the analysis of the metagenomics data is carried out by Mr Morales’ lab. Ms Morgan also works on metagenomics projects either in microbial ecology or human microbiome and is very skilled in computational biology, which makes exchanges between the labs crucial.

2) Area of work: microbial ecology The evolution of microbiology, and particularly microbial ecology, has been strongly correlated

with the development of technologies such as sequencing.

The study of genomics started with the sequencing of cultivated bacteria. With the emergence of more global approaches such as 16S rRNA sequencing, scientists realized that the cultivation approach missed the majority of microorganisms present in an environmental sample, because they could not be easily cultivated.

Page 7: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

3

Thus with the development of NGS (next generation sequencing), metagenomics appeared as particularly adapted to the study of microbial ecology. By taking a sample and sequencing the whole community, scientists were able to take a snapshot of a microbial community in a culture-independent way and gain valuable insights into ecosystems: community diversity, present functions, microbial richness…

Still, a lot of biomes are very far from being elucidated, and this is particularly true in the oceans where the conditions are quite unlike anywhere else but microorganisms can still be found almost everywhere. It is thus very interesting to study them in order to understand how they adapt to a particular environment, and possibly discover useful new metabolic functions.

New Zealand is rather adapted to this type of study: it is an island, providing it with a lot of different water masses to survey, and it is also a country where a big part of the land is devoted to agriculture and breeding, practices that have important impacts on soil properties. Both these ecosystems are investigated in the Morales Lab. While my focus during this internship has mostly been on a marine microbiology project, I have also had the opportunity to collaborate on two other projects relative to soil ecosystems.

3) Projects worked on With NGS enabling relatively easy and quick generation of impressive datasets, appropriate

analysis tools have had to be developed. Binning is a rather recent method that endeavours to recover partial or complete genomes from metagenomic datasets by grouping reads based on different characteristics such as GC content or coverage.

Most of my work during this internship consisted in learning about this method, and choosing from the various tools available to apply binning to the Munida transect project (ocean samples).

I also had the opportunity to learn about other techniques by helping out on two soil sample projects relying on 16S analysis.

Page 8: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

4

II- Binning of ocean samples

1) Bibliography a. The Munida transect Oceanic fronts are transition zones that appear where two water masses of different

properties come into contact. They present important shifts in physico-chemical properties as well as microbial community structure.

The Subtropical Front is the front separating the warm and nutrient-poor Subtropical Waters (STW) from the cold, nutrient-rich sub-Antarctic Waters (SAW). This front follows the South-East coast of New Zealand, and due to the presence of the continental shelf, it is constricted there to a width of 2-10 km about 40-50 km offshore. This makes Dunedin a perfect place to study it, and sampling campaigns have been routinely carried out, along the line of the Munida Time Series transect represented on the following map (Fig. 1).

Though this site has been continuously sampled for almost two decades for physicochemical parameters, few studies have focused on the changes in microbial communities across the front.

In the context of this study, sampling took place on the 28th of January 2014 at 8 different stations along the transect. Two samples were collected from each station at a depth of 2m and an additional two were taken from station 8 at 500m deep (Baltar, Currie, Stuck, Roosa, & Morales, 2016). These samples were analysed for physicochemical parameters, microbial abundance and also microbial community composition using 16S rRNA technique.

Fig. 1 Map of the oceanographic settings around the Southern island of New Zealand and

position of the Munida Time Series transect (Baltar et al., 2016)

Page 9: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

5

We can see in Fig. 2 that the actual front seems to be located between stations 3 and 5, as shown by the sharp gradient both in salinity and temperature, with stations 1 through 3 being the Subtropical waters and stations 5 to 8 the sub-Antarctic waters, the latter ones being, as expected, colder and less salty than the first ones. The sharp rise in salinity between stations 1 and 2 is due to the prevalence of neritic waters at station 1, and highlights the influence of riverine waters.

It has been observed that the Sub-Antarctic Waters are HNLC waters (High Nutrient Low Chlorophyll), while the opposite is true for Subtropical waters. Chlorophyll-a level is an indicator of the abundance of phytoplankton, and it has been shown that these low levels in SAW are due to the lack of micro-nutrients, and particularly iron, lack that hinders growth of the phytoplankton.

Like the total levels of phytoplankton, the abundance of specific microorganisms varies between the different water masses ( Fig. 3). The community composition changes as one moves across the transect, suggesting that the front might act as an ecotone, an ecological interface between two different ecosystems. We can also see that there is an important difference in community composition between the surface and deep samples.

Fig. 3 Changes in bacterioplankton community composition at phylum level along the transect for the 8 surface samples and

also the deep sample at station 8 (Baltar et al., 2016)

Fig. 2 Evolution along the transect of temperature and salinity (psu) (Baltar, Stuck, Morales, & Currie, 2015)

Page 10: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

6

My input to this existing project was to apply a new bioinformatics technique called binning to the samples to try to gain further information.

Binning endeavours to group reads by relying on different characteristics such as GC percentage, coverage… in order to recover partial or complete genomes. This can then lead to assembly of the groups obtained, or study of the functions present in each group to test different hypotheses. Since a lot of the microorganisms present in environmental samples such as these ocean water samples are unknown, this method could help gain interesting insights. Here the idea was to try to identify specific functions and link them with the presence or absence of the microorganisms in certain water masses, at certain times of the year…

b. Binning: a new approach As stated above, binning aims at recovering partial genomes from a metagenomic dataset by

grouping reads based on certain characteristics (GC content, coverage…). In recent years, a number of tools have been developed to carry out this technique, taking into account different characteristics of the data, relying on different algorithms, automating the process or not…

One of the first ways to classify these tools is to determine whether they carry out supervised or unsupervised binning. Binning is said to be supervised when it relies on a reference database or a prior training set to cluster reads into groups. This will apply for example to tools that use a homology-based search against a database to classify the data, which is one of two traditional approaches used in binning, the other one focusing on sequence composition (Table 1).

Table 1 The two traditional binning approaches and their limitations.

Sequence composition (GC content, tetra-nucleotide frequency…)

Homology-based search against a database

-Relatively long sequences needed

-High taxonomic level assignment

-Prior training sequences must exist to give taxonomic assignment

-Computationally heavy and time consuming

-Depends on a database

Both these common approaches have specific limitations. In the case of a homology-based search, it appears clear that when dealing with very large

datasets, this technique will be very time-consuming and require a lot of resources in order to align all the different reads to a database. It also assumes that all microorganisms present in the sample have close relatives that have been sequenced. On the other hand, this method does not depend on the size of the reads for its efficiency, whereas it will be an issue with approaches relying on sequence composition to sort the reads.

Indeed, when computing GC content or tetra-nucleotide frequency in reads, the size of the reads will impact the representativeness of the results. Usually, sequences are required to be at least 800 bp. It is also less precise that an alignment to a database and usually will not grant so precise a

Page 11: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

7

taxonomic assignment. Finally, if the binning method in itself does not depend on a database, there will be a need for prior training sequences such as marker genes to assign a taxonomy to the groups.

Binning tools that have been developed in recent years tend to go more and more towards unsupervised and automated methods, often relying on composition and/or coverage (abundance profile).

For the latter, the assumption is that since there is a fixed quantity of a microorganisms in the sample, even if by sequencing the genomes they are split into small parts (reads), all the reads coming from a particular microorganism should be as abundant. Computing an abundance profile would then be a way to sort reads into bins representative of a microorganism. Table 2Erreur ! Source du renvoi introuvable. below presents different binning tools and their characteristics. Table 2 Different binning tools and their characteristics.

Software MetaBin1 GroopM Concoct MetaBAT MaxBin 2.0

Publication year

2012 2014 2014 2015 2015

Approach Homology-based search

Coverage Composition and coverage

Composition and coverage

Composition and coverage

Algorithm Blat relying on ORFs

- Gaussian mixture model

K-medoid clustering on probabilistic distances

Expectation-Minimization

The team that developed the MetaBAT (Kang, Froula, Egan, & Wang, 2015) binning tool has

carried out benchmarking on several automated binning tools2. To do this, they used a metagenomics dataset from the MetaHIT consortium (Metagenomics of the Human Intestinal Tract).

They first selected 290 reference genomes from NCBI that were present in the MetaHIT

samples at over 5x mean coverage. These genomes were then shredded to contigs of random sizes over 2.5 kb. The distribution of these contigs and their abundance were obtained from the real data. They then performed binning on this “error-free” dataset using MetaBAT, Canopy (Nielsen et al., 2014), CONCOCT v0.4.0 (Alneberg et al., 2014), GroopM v0.3.0 (Imelfort et al., 2014) and MaxBin v.1.4.1 (Y.-

W. Wu, Tang, Tringe, Simmons, & Singer, 2014). The criteria used to consider a bin acceptable were over 90% precision, which is the same as under 10% contamination, and 30% recall or completeness. These were ascertained using the reference genomes, and the bins are then called genomes. A summary of the results can be found in Table 3.

1 (Sharma, Kumar, Prakash, & Taylor, 2012) 2 https://bitbucket.org/berkeleylab/metabat/wiki/Benchmark_MetaHIT

Page 12: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

8

In this experiment, MetaBAT was the tool that identified the most “genomes”. It is also the

most computationally efficient tool with a processing time under 14 min and a peak memory usage under 4G. Results from GroopM highlight an important difference between the number of bins identified and the number of bins with high enough completeness and precision to be considered as representative of genomes. It is also interesting that the number of genomes identified with this error-free dataset is inferior to half the number of actual genomes that were used to create the dataset for all of the tools.

Following this first experiment, the tools were then tested on a real metagenomics dataset. All the sequences from the samples from the MetaHIT dataset previously used were pooled and assembled. This time the cut-off chosen for the contigs size was 1.5 kb to allow for the presence of shorter contigs in real metagenomics assemblies. Since there were no reference genomes to ascertain the quality of the bins in this case, the tool CheckM (Parks, Imelfort, Skennerton, Hugenholtz, & Tyson,

2015) was used. CheckM is a companion tool of GroopM, but can be applied to bins from any tool or software. Using pre-defined marker gene sets, it will assess the completeness and contamination of the bins. A marker gene is supposed to be found in a single copy in a complete genome, and the whole set of marker genes should be present in a complete genome. This means that bins possessing more than one copy of a given marker gene will be considered contaminated, while depending on the number of genes from the set identified in the bin, it will be judged as more or less complete.

Fig. 4 Performance of five different binning tools on a real metagenomic dataset. The number of bins identified

(X-axis) by each tool (Y-axis) is presented for 0.9 precision (lack of contamination), while the gradient in colour

highlights the recall level (completeness) (Kang, Froula, Egan, & Wang, 2015).

Table 3 Binning performance of five different tools on synthetic metagenomic assemblies (Kang, Froula, Egan, & Wang,

2015).

Page 13: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

9

Once again, MetaBAT was the tool that identified the most genomes (Fig. 4), and was also quite efficient computationally (Table 4). It also found many more bins with 0.9 precision and 0.9 recall than all the other tools. GroopM is again the least efficient tool though it appears to perform comparatively better on this real dataset than on the synthetic one.

With the help of this benchmark and considering the specifications of my project (type of data, technical possibilities, reproducibility…), I selected several binning tools to use on the Munida transect data: GroopM, MaxBin 2.0 (Y. W. Wu, Simmons, & Singer, 2015) and MetaBAT.

2) Material & Methods Through this entire project, I worked with Unix in command line, either on my own computer or

on the lab server.

a. Samples Two sets of samples were analysed for this project: the Munida Time Series transect samples,

and three samples from the Tara Oceans project.

The Munida transect samples were obtained on the 28th of January 2014 at 8 different stations along the transect. Two replicates were taken from each station, and an additional two from station 8 at 500m depth. Replicates from each water mass (Subtropical, Sub Antarctic and Sub Antarctic at 500m deep) were then sequenced using Illumina MiSeq and 2x150 bp libraries were generated using Nextera XT.

The second set of sample used was taken from the Tara Oceans project database. Tara Oceans is a large scale initiative that aimed at better characterizing the global oceanic planktonic ecosystem, from viruses to zooplankton. During the actual sampling campaign between 2009 and 2013, over 200 oceanic stations were sampled at several depths. The samples were then sent to partner labs where they were analysed, and notably shotgun sequenced either with Illumina HiSeq 2000 or 454 GS FLX Titanium. The results of these analyses are available freely on the internet.

To apply binning one station was selected, station 23 sampled between 5 and 7m below the surface, and sequence files for the appropriate microorganism size were retrieved. They can be found by following this link: https://www.ebi.ac.uk/metagenomics/projects/ERP001736/samples/ERS477979

The contig file obtained by the Tara team was also retrieved here: http://www.ebi.ac.uk/ena/data/view/CENN01000001-CENN01195903

Table 4 Binning performance of five different tools on a real metagenomics dataset (Kang, Froula, Egan, & Wang, 2015).

Page 14: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

10

b. Coassembly of the reads and mapping Coassembly of the Munida samples was performed using Meta-Velvet with default parameters

and a hash length of 51. For Munida as well as Tara samples, the reads were then mapped to the contigs obtained using bwa and samtools with default parameters.

c. Binning Three binning tools were used for this study: GroopM, MetaBAT and MaxBin 2.0.

They were first tested on a synthetic metagenomics dataset used in the GroopM publication that can be found here: https://github.com/minillinim/GroopM_test_data/tree/master/synthetic_metagenome The contig file and BAM files provided were used.

d. Quality check and taxonomic assignment of the bins The quality of the bins was determined by applying CheckM to the bins, using the lineage

command with default parameters. Taxonomic assignment of the bins was performed using Kraken with the reduced database, and the SSU finding tool in CheckM.

3) Results and Discussion After having installed and tested the different tools described above, the pathway was applied to the Munida samples with these four steps: assembly and mapping, binning, quality check and finally taxonomic assignment. Out of the three binning tools used, only two actually identified bins: MetaBAT did not produce any. MaxBin found seven bins, while GroopM found sixteen. Table 5 and Table 6 present the summary of the quality of the bins obtained after running them through CheckM.

It is clear from Table 5 that the results from GroopM are not trustworthy: contamination is estimated to over 100 for nearly half of them. This is in itself impossible since contamination is supposed to be expressed as a percentage. It was hypothesized that GroopM was not sensitive enough to differentiate between the reads and that bins were formed at random. This is supported by the fact that this tool only relies on coverage to identify bins and the Munida samples were not sequenced to a very big depth. It can also be noted that the manual step that is offered in GroopM’s workflow was not used here.

As for the results from MaxBin, no bin satisfied both the criteria used in the presented publications of over 70% completeness and under 30% contamination. Still, an attempt at taxonomic assignment of the bins was made using Kraken, and the SSU finder option present in CheckM. The latter one failed to identify any 16S or 18S gene in the bins. In the same way, Kraken was unable to assign a taxonomy to the bins with reliability.

Page 15: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

11

Table 5 Result of quality check of the bins identified by GroopM from the Munida samples. This summary presents the bin

number, and several characteristics estimated using marker gene sets: the lineage that the sets were taken from,

completeness of the bins, contamination, and strain heterogeneity (an estimation of the number of strains from which the

reads forming the bin come from).

Table 6 Result of quality check by CheckM of the bins identified by MaxBin from the Munida samples.

These relatively poor results from the binning tools might be explained by the characteristics of the data. Indeed, only three samples were sequenced with miSeq technology which is not the most appropriate for genomic studies, and yielded short reads with poor coverage. This in turn led to short contigs (only 3 over 1 kb), with a higher possibility of error. These short contigs might explain why MetaBAT was inefficient on this data, since it uses a cut-off that selects for long contigs only.

All in all, these results were not very surprising due to the shortcomings inherent to the Munida transect data. In order to try to get better results, more samples would have been needed, and a better sequencing technology used. Since it was not possible for the team to get more samples, it was decided to use data from the Tara Oceans project.

It was hypothesized that since this data was sequenced using HiSeq technologies, the depth and quality of the reads would be better, as well as binning results. Once bins were identified, the reads from the Munida samples would be mapped to these bins, and thus grouped with better reliability.

Considering the results given previously by GroopM, only MetaBAT and MaxBin were used with the Tara samples. MetaBAT identified 22 bins while MaxBin found 46. Table 7 and Table 8 present the summaries of the quality checks performed with CheckM on the bins obtained. It is clear that binning of these samples was more efficient than with the Munida samples.

Page 16: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

12

However, looking at the bins yielded by MaxBin, only one answers the criteria of over 70% completeness and under 30% contamination. Contamination is much lower in the bins identified by MetaBAT, except for the two first. These are probably random bins where the reads that were unable to be grouped otherwise were put. It is also interesting that CheckM and its marker gene sets assigned some taxonomy to certain bins. However, CheckM’s SSU finder was unable to identify 16S or 18S genes in any of the bins, and Kraken’s assignment was not very reliable: it relied only on 5% or less of the reads, and the marker genes that were identified were very varied and did not point to one taxonomy. Table 7 Result of quality check of the bins identified by MaxBin from the Tara samples.

Page 17: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

13

Table 8 Result of quality check of the bins identified by MetaBAT from the Tara samples.

Overall, the results of the binning of the Tara samples were not as good as expected, seeing as three samples sequenced with a HiSeq technology and important depth were used. The number of reads was more than ten times that of the Munida samples. The lack of reliable taxonomic assignment could be overlooked since a lot of the organisms present in these samples are not known, but still the quality of the bins is rather poor and very few were recovered compared to similar projects where a few hundred bins are usually identified. This is rather puzzling if one compares these results to other publications or benchmarking.

The explanation could be in the origin of the samples: this project dealt with samples from the ocean, which contains large ecosystems with mostly uncultured and uncharacterized microorganisms. On the other hand, the tools that were used are tested on synthetic dataset that contain few errors, and on well bounded ecosystems such as gut microbiome, where the classes of microorganisms are better characterized. The Banfield lab from Berkeley University, US, has published a few papers using binning in environmental settings with good results but all of this binning is done by hand, which makes it hard to reproduce and not really adapted to this project and the goals that were fixed for it.

Binning seems to have the potential to be a powerful and exciting tool that could greatly help scientists characterize unknown ecosystems but it is only beginning to be developed and in the future it will hopefully get even better at discriminating between species.

To conclude about this project, the method did not prove very conclusive but it could be useful for different projects in Mr Morales’ lab, or more largely in the Department. In order to retain the experience gained, I wrote several tutorials giving instructions to use the different tools that I tried out during this project (see Annex for an example).

Page 18: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

14

III- Soil projects

I had the opportunity to collaborate on two other projects while on this internship, both of them dealing with soil ecosystems, one of the South Island of New Zealand, the other one in Papua New Guinea.

1) Studying soil communities of the South Island of New Zealand a. Introduction to the project

This research project is conducted by Rachel Kaminsky, and will be the basis for her PhD thesis. It aims at characterizing microbial communities across a range of soils in New Zealand and to investigate the link between the changes observed and soil properties such as pH, land use or soil classification.

Between May the 5th and the 16th of 2014, a total of 288 samples were gathered from 24 field sites (12 samples per site) across four different regions on the South Island of New Zealand (Fig. 5). These sites comprised different type of land uses: high country, dairy, and beef and sheep, which are the three main land uses in New Zealand agriculture. Soil properties were also determined for each sampling site through the New Zealand Land Resource Information Systems Portal3, attributing a Soil Order to each sample.

3 https://lris.scinfo.org.nz/

Fig. 5 Map of sampling sites throughout the South Island, with colours highlighting the different

regions, triangles representing high country, circles dairy, and squares beef and sheep farming.

Page 19: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

15

Chemical analyses were performed on the samples to determine pH, DNA was extracted and the 16S rRNA gene amplified using a universal primer pair. The resulting reads were clustered into OTUs at 97% similarity using Qiime and assigned taxonomy using Blast. Statistical analysis was then performed mostly using R to try to determine correlations between the OTUs, their presence and abundance, and soil properties.

In order to have a clear presentation of the results to be included in a publication, I was tasked with implementing a new graphical tool: GraPhlAn.

b. Using GraPhlAn to present the results This software working from command line4 takes a phylogenetic tree and an annotation file and

builds a figure presenting the phylogenetic tree surrounded by a number of rings highlighting the OTUs linked to a chosen characteristic.

Here two figures were needed: one representing the OTUs significantly correlated with pH and land use, the other one OTUs significantly correlated with soil order.

Working from the software’s manual and a colleague’s previous experience, I acquainted myself with the data and the software in order to put together the desired figures. This included understanding given Perl scripts and adapting them to the purpose as well as writing python scripts of my own. GraPhlAn is a very customisable software, and every graphical aspect of the figure can be modified at will. Below are the final results of my work (Fig. 6 and Fig. 7). Following this work, I put together a tutorial designed to help the lab’s members to use GraPhlAn in the future.

4 https://bitbucket.org/nsegata/graphlan

Fig. 6 GraPhlAn figure representing OTUs significantly associated with high and low country farming, and pH. Light blue

means a negative correlation to pH while dark blue means a positive correlation.

Page 20: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

16

2) Comparing microbial communities from different environments in Papua New Guinea

a. Introduction to the project In the last few decades, oil palm production has taken an important place in the economy of Papua

New Guinea. More and more land is allotted to the culture of palm trees, and this impacts the properties of the soil. A team of researchers has been investigating these changes following conversion of grassland or forest to palm plantation (Goodrick, Nelson, & Banabas, 2014; Nelson et al., 2014). They then turned to the Morales Lab to help them look at these changes from a microbiological point of view.

Having taken samples from grassland, forest, and palm plantation environments, they had the top layer of each sample sequenced for 16S rRNA. After the first set of statistical analysis applied to the data, it appeared that the difference in diversity between the different samples was much less

Fig. 7 GraPhlAn figure representing OTUs significantly associated with one of the four soil orders: Pallic, Brown, Gley and Recent.

Page 21: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

17

important than expected and was most significant between grassland and forest, as seen in Fig. 8. This might be due to the fact that only the top layers were analysed, and not the deeper ones that could have reflected more profound changes, particularly between grassland and oil palm.

According to these results, the focus was shifted to a comparison between grassland and forest microbiomes.

We first tried to associate the changes in microbial community with different factors such as pH (see above). A Principal Components Analysis was used to try and elucidate the most important factors driving the changes. Then the same statistical analyses as before (diversity, dissimilarity…) were performed again in R this time using the three first factors identified by the PCA.

Fig. 8 Comparison of alpha diversity between the samples from the three environments: grassland, forest and oil palm.

Fig. 9 Non-metric multidimensional scaling plot of the Papua New Guinea samples, coloured according to the first factor

identified by the PCA, and shaped according to land use.

Page 22: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

18

Once again, there is not a very definite trend identifiable (see Fig. 9 above for example). It was then decided to focus on the phylogenetic changes in the community. My subsequent work thus aimed at illustrating the changes in microorganism diversity and abundance in the clearest way possible.

b. Using Graphlan and step plots to present the results In order to do this, I used simper, an R function in the vegan package. Simper, or the similarity

percentage analysis, compares two groups and computes the average contribution of each species to the overall dissimilarity between the groups (based on Bray-Curtis dissimilarity index). Simper will yield a table containing each species ordered by contribution to the dissimilarity (expressed as a percentage), as well as abundance of the species in each group, and the sum of the contributions.

I applied this function to the data, choosing Grassland and Forest as my groups, and the OTUs identified previously as my species. I also used the abundance data provided by simper to compute the fold change for each OTU between the two conditions. I then selected the OTUs contributing most to the dissimilarity of the groups up to a cumulated sum of 30% and constructed the Graphlan figure presented below (Fig. 10).

Fig. 10 Graphlan figure representing species contributing the most to the overall dissimilarity up to a total sum of

30%. The gradient indicates the fold change in abundance between grassland and forest: green means that the

OTU is found more abundantly in Grassland samples, whereas red is associated with Forest samples.

Nodes are coloured according to this fold change as well, with higher taxonomic rank nodes’ colour being determined as the average. A black node means that there were as many OTUs associated with Forest as there

were with Grassland in this branch.

Page 23: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

19

Thanks to this figure, I was able to isolate patterns more easily. These patterns were represented as step plots taking into account 50% of the cumulated dissimilarity this time (Fig. 11). In Fig. 11, Verrucomicrobia is an example of a phylum not showing a pattern.

These figures will then be used by the team of researchers in their coming communications.

Fig. 11 Step plots representing relative abundance in the two conditions for various clades, from the top left corner:

Bacteroidetes, Archaea, Massilia (part of Betaproteobacteria), Firmicutes, Gemmatimonadetes and Verrucomicrobia. OTUs

were selected up to 50% cumulated dissimilarity.

Page 24: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

20

Conclusion

During this internship, I had the opportunity to work on several projects, but most of my time was dedicated to the application of a new bioinformatics method called binning to ocean samples from the Munida transect.

This was a very interesting project for me because being the only one working on it I was able to conduct it from start to finish, first studying the bibliography to understand the context of the study and the different tools available, then selecting the appropriate tools by taking into account their specificities and my constraints, and finally applying these tools to the data and analysing the results. These results were not as good as expected, and did not enable the lab to draw further conclusions on their samples. However, binning is an up and coming technique, and the tools and method might be used in projects in the future, thanks to the tutorials that I put together.

The other projects I collaborated on were developed by other members of the lab, and focused on soil samples, either from the South Island of New Zealand or from Papua New Guinea. In the first project, I was tasked with trouble-shooting a graphical presentation software called Graphlan. I was then able to use the expertise gained while working for this project on the second one. This project aimed at characterizing the changes in microbial community between forest and grassland environments in Papua New Guinea. After performing more statistical analysis on the data, I endeavoured to present the results in the clearest way possible by highlighting the interesting patterns in microbial abundance changes. These figures should then be used by the researchers in their coming publication.

Taking part in these three projects enabled me to gain a varied set of skills: metagenomics basic tools (assembly tools, bwa, samtools), new tools relative to binning, a graphical presentation software (Graphlan), more experience with R… All these projects also involved getting to know Unix rather well, and knowing where to go get information to solve any issue I encountered.

All in all, this internship was very enriching for me. Mr Morales was very intent on getting me to work on projects with a precise goal and left me a lot of freedom and autonomy, and the time necessary for me to figure out by myself how to solve my problems, which is the best way to learn. He was also very attentive to what was interesting to me, and encouraged me to attend a study group organised by the Biochemistry Department presenting many useful bioinformatics tools.

To sum up, this internship was a great way to build on the skills that I started to acquire during the first semester, and confirmed my interest for the field.

Page 25: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

21

Bibliographical references

Alneberg, J., Bjarnason, B. S., de Bruijn, I., Schirmer, M., Quick, J., Ijaz, U. Z., … Quince, C. (2014). Binning metagenomic contigs by coverage and composition. Nature Methods, 11(11), 1144–1146. http://doi.org/10.1038/nmeth.3103

Baltar, F., Currie, K., Stuck, E., Roosa, S., & Morales, S. E. (2016). Oceanic fronts: Transition zones for bacterioplankton community composition. Environmental Microbiology Reports, 8(1), 132–138. http://doi.org/10.1111/1758-2229.12362

Baltar, F., Stuck, E., Morales, S., & Currie, K. (2015). Bacterioplankton carbon cycling along the Subtropical Frontal Zone off New Zealand. Progress in Oceanography, 135, 168–175. http://doi.org/10.1016/j.pocean.2015.05.019

Goodrick, I., Nelson, P. N., & Banabas, M. (2014). Soil carbon balance following conversion of grassland to oil palm, 1–10. http://doi.org/10.1111/gcbb.12138

Imelfort, M., Parks, D., Woodcroft, B. J., Dennis, P., Hugenholtz, P., & Tyson, G. W. (2014). GroopM: an automated tool for the recovery of population genomes from related metagenomes. PeerJ, 2, e603. http://doi.org/10.7717/peerj.603

Kang, D. D., Froula, J., Egan, R., & Wang, Z. (2015). MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ, 3, e1165. http://doi.org/10.7717/peerj.1165

Nelson, P. N. A., Banabas, M. B., Nake, S. B., Goodrick, I. A., Webb, M. J. C., & Gabriel, E. A. (2014). Soil fertility changes following conversion of grassland to oil palm, 698–705.

Nielsen, H. B., Mathieu, A., Juncker, A. S., Rasmussen, S., Li, J., Sunagawa, S., … Ehrlich, S. D. (2014). Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nature Biotechnology, 32(8), 822–828. http://doi.org/10.1038/nbt.2939

Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P., & Tyson, G. W. (2015). CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25(7), 1043–55. http://doi.org/10.1101/gr.186072.114

Sharma, V. K., Kumar, N., Prakash, T., & Taylor, T. D. (2012). Fast and accurate taxonomic assignments of metagenomic sequences using metabin. PLoS ONE, 7(4). http://doi.org/10.1371/journal.pone.0034030

Wu, Y. W., Simmons, B. A., & Singer, S. W. (2015). MaxBin 2.0: An automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics, 32(4), 605–607. http://doi.org/10.1093/bioinformatics/btv638

Wu, Y.-W., Tang, Y.-H., Tringe, S. G., Simmons, B. a, & Singer, S. W. (2014). MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome, 2(1), 26. http://doi.org/10.1186/2049-2618-2-26

Page 26: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

22

Annex:

Binning on bb1 server Start with: your read files in fasta or fastq format

1) Assembly with Metavelvet http://metavelvet.dna.bio.keio.ac.jp/MSL.html -Velveth $ velveth output_directory hash_length input [options] Max hash_length (kmer) = 63 $ velveth coassembly 51 -fastq coassembly_input.fastq

Output: Roadmaps, Sequences -Velvetg $ velvetg directory [options] To be able to apply metavelvet next, need options -read_trkg yes and -exp_cov auto $ velvetg coassembly -exp_cov auto -read_trkg yes

Output: contigs.fa, stats.txt, LastGraph -Meta-velvetg $ meta-velvetg output directory [options] $ meta-velvetg coassembly

Output: meta-velvetg.contigs.fa, meta-velvetg.LastGraph, meta-velvetg.Graph2-stats.txt, meta-velvetg.split-stats.txt For next step get: read files + meta-velvetg.contigs.fa

2) Binning with MaxBin 2.0 http://downloads.jbei.org/data/microbial_communities/MaxBin/MaxBin.html http://microbiomejournal.biomedcentral.com/articles/10.1186/2049-2618-2-26 http://bioinformatics.oxfordjournals.org/content/32/4/605.full.pdf $ perl path/to/Maxbin/run_MaxBin.pl -contig -out -abund/-reads Input information can be either the read files or abundance files (but if you do not have it as here, MaxBin will compute it itself with bowtie2). A list can also be submitted $ perl /APPS/MaxBin-2.2/run_MaxBin.pl -contig coassembly/meta-velvetg.contigs.fa -out maxbin_out

-reads_list list

(where list is a simple text file with the name of your read files one under the other. You can create it

with your usual text editor, or use vi from the command line)

Output: bins as fasta files containing the reads assigned to each bin, also several abundance files, marker files and stats.

MaxBin will give you its output as files with names starting with what you put as out argument. You can then put all of this in a directory. ($ mkdir Maxbin $ mv maxbin_out.* Maxbin/

Page 27: Application of bioinformatics tools to microbial ecologypagesperso.univ-brest.fr/~maignien/doc/Blandine_MSc_report.pdf · Application of bioinformatics tools to microbial ecology

23

For next step: you will need mostly your bins but might also need your input reads.

3) Checking the quality of the bins with ChekM https://github.com/Ecogenomics/CheckM/wiki $ checkm lineage_wf bin_folder out_folder [options] Useful options: --pplacer_threads (number of threads to use for pplacer part), -x (extension of bins) $ checkm lineage_wf -x fasta Maxbin/ Maxbin_checkm/ --pplacer_threads 8 > summary_maxbin

CheckM usually prints the summary as standard output so if you want to save add an output_file at the end of your command as done here. It will also give you various stats about your bins. If you have less than 40 GB memory space (RAM) available, add option --reduced_tree (not good for laptop) Lineage_wf actually regroups multiple commands that you can also use one by one (see wiki). CheckM can also check the bins to see if it can locate any SSU: $ checkm ssu_finder seq_file bin_folder out_folder $ checkm ssu_finder -x fastq coassembly_input.fastq Maxbin/ SSU/

Output: tables by type of SSU (bacteria, eukarya…) and a summary table with the read, its bin (if it is binned), its assignment etc… CheckM has several other interesting features that you can all find on the wiki. Finally you can try assigning a taxonomy to your bins.

4) Assigning taxonomy to the bins with Kraken https://ccb.jhu.edu/software/kraken/MANUAL.html Xochitls knows more about this than me, I used her program and based myself on her experience. Here is just the command line for what I wanted to do, kraken can do many more things. -Assign taxonomy: $ kraken --db /APPS/kraken/minikraken_20141208 --threads 12 --fasta-input --preload --only-

classified-output bin_number.fasta > out_number.krk

-Make output readable (this gives you one file for all of the bins you analysed) $ kraken-mpa-report --db /APPS/kraken/minikraken_20141208 *.krk > my.report

Output: a tree of the taxonomy assigned with the number of reads assigned to each