advancing science with dna sequence metagenome analysis natalia ivanova mgm workshop february 2,...

22
Advancing Science with DNA Sequence Metagenome analysis Natalia Ivanova Natalia Ivanova MGM Workshop MGM Workshop February 2, 2012 February 2, 2012

Upload: neal-white

Post on 19-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advancing Science with DNA Sequence Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012

Advancing Science with DNA Sequence

Metagenome analysis

Natalia IvanovaNatalia Ivanova

MGM WorkshopMGM Workshop

February 2, 2012February 2, 2012

Page 2: Advancing Science with DNA Sequence Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012

Advancing Science with DNA Sequence

1. Metagenome definitions:1. Metagenome definitions:

a refresher course a refresher course

Page 3: Advancing Science with DNA Sequence Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012

Advancing Science with DNA Sequence

Metagenome is a collective genome of microbial community, AKA microbiome (native, enriched, sorted, etc.).

Metagenomic library (or libraries) is constructed from isolated DNA (native, enriched, etc.).

Metagenomic library can be single-end (AKA standard)

or paired-end

Metagenome definitions

Page 4: Advancing Science with DNA Sequence Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012

Advancing Science with DNA Sequence

Single-end (standard) metagenomic library will produce contigs upon assembly (i. e. longer sequences based on overlap between reads)

Any Ns found in contigs correspond to low quality bases

Paired-end metagenomic library will produce scaffolds upon assembly (non-contigous joining of reads based on read pair information)

Ns found in scaffolds correspond either to low quality bases or to gaps of unknown size

ATGCAAAGGCCGCATCCAGCAGGTT

TACGTTTCCGGCGTAGGTCGTCCAA

ATGCAAAGGCCGCATCC

TACGTTTCCGGCGTAGG

AGCAGGTT

TCGTCCAANNNNNN

Metagenome definitions

Page 5: Advancing Science with DNA Sequence Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012

Advancing Science with DNA SequenceAmplified and Unamplified

Libraries

Fragmentation (1ug)

A-tailing with Klenow exo-

End repair / Phosphorylation

DNA ChipHeat Inactivation

Double SPRI

Fragmentation (1ug)

A-tailing with Klenow exo-

Adaptor Ligation

End repair / Phosphorylation

DNA Chip

Double SPRI

SPRI Clean

SPRI Clean

SPRI Clean

PCR 10-cycle Amplification

Amplified Library Unamplified Library

Adaptor Ligation

DNA Chip

qPCR Quantification

SPRI Clean DNA Chip

qPCR Quantification

SPRI Clean

Page 6: Advancing Science with DNA Sequence Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012

Advancing Science with DNA Sequence

Unless the community has very low complexity (i. e. dominated by one or a few clonal populations), assembly at 100% nucleotide identity will be very fragmented.

What to do with k-mer based assemblies?Use multiple k-mer settings, combine

assemblies with an overlap-layout consensus assembler like minimus2 using minimal % identity of 95%. Tradeoff between overlap length and % identity.

Metagenome definitions (contd):

overlap = alignment of reads at x% identity

Page 7: Advancing Science with DNA Sequence Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012

Advancing Science with DNA Sequence

Reasoning behind combining multiple assemblies

Page 8: Advancing Science with DNA Sequence Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012

Advancing Science with DNA Sequence

Assembly Pipeline v.0.9

Trimming does not appear to be ideal for this process

Picking best kmer – manual process

CPU time intensive, no known metagenomic Kmer prediction algorithm

8

A snapshot of older (454-Illumina) metagenome assembly pipeline

Page 9: Advancing Science with DNA Sequence Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012

Advancing Science with DNA Sequence

Assembly of sequences at less than 100% identity =>

population contigs and scaffolds representing a consensus sequence of species populationisolate contig species population

contigs

Metagenome definitions (contd):

overlap = alignment of reads at x% identity

Page 10: Advancing Science with DNA Sequence Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012

Advancing Science with DNA Sequence

2 more important definitions

1. Sequence coverage (AKA read depth)

How many times each base has been sequenced => needs to be considered when calculated protein family abundance

Per-contig average coveragePer-base coverage => per-gene coverage2. Bins Scaffolds, contigs and unassembled reads can be

binned into sets of sequences (bins) that likely originated from the same species population or a population from a broader taxonomic lineages

Page 11: Advancing Science with DNA Sequence Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012

Advancing Science with DNA Sequence

What IMG does and doesn’t do

• Scaffolds and contigs are generated by assembly – not provided in IMG/M

• Sequence coverage can be computed by the assembler based on alignments it generates (preferable) or can be added later by aligning reads to contigs – the latter can be provided in IMG/M

• Bins are generated by binning software – not provided in IMG/M

• Scaffolds, contigs and unassembled reads are annotated with non-coding RNAs, repeats (CRISPRs), and protein coding genes (CDSs); the latter are assigned to protein families (COGs, Pfams, TIGRfams, KEGG Orthology, EC numbers, internal clusters) – is provided in IMG/M

Page 12: Advancing Science with DNA Sequence Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012

Advancing Science with DNA Sequence

What’s the difference between IMG and MG-RAST, IMG and CAMERA?

• We prefer to assemble the data longer sequences -> better quality of gene prediction and functional

annotation longer sequences -> chromosomal context and binning -> population-

level analysis• But we don’t provide assembly services except for metagenomes

sequenced at the JGI we may be able to help with assembly of 454 we’re not equipped to assemble massive amounts of Illumina data

http://galaxy.jgi-psf.orgContact person: Ed Kirton, [email protected]

• IMG does not provide tools for analysis of 16S data from the metagenome itself

we do assembly -> none of assembled 16S sequences is reliable BLASTn of reads matching conserved regions is misleading we do pyrotags for every metagenome sequenced at the JGI

http://pyrotagger.jgi-psf.org

Page 13: Advancing Science with DNA Sequence Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012

Advancing Science with DNA Sequence

2. IMG/M features:2. IMG/M features:divide and conquer divide and conquer

(see also IMG/M -> Using IMG/M -> Using IMG/M -> IMG (see also IMG/M -> Using IMG/M -> Using IMG/M -> IMG User Guide and IMG/M Addendum)User Guide and IMG/M Addendum)

http://img.jgi.doe.gov/m

http://img.jgi.doe.gov/merusername: publicusername: publicpassword: publicpassword: public

Page 14: Advancing Science with DNA Sequence Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012

Advancing Science with DNA SequenceIMG/M User Interface MapAbout IMG/M -> Using IMG/M -> User

Interface Map

Page 15: Advancing Science with DNA Sequence Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012

Advancing Science with DNA Sequence

Dividing the contigs by GC content or length

• StatisticsMicrobiome Details ->

Genome Statistics -> DNA Scaffolds

• SearchMicrobiome Details ->

Scaffold Search

Page 16: Advancing Science with DNA Sequence Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012

Advancing Science with DNA SequenceDividing the genes phylogenetically:

Phylogenetic DistributionPhylogenetic Distribution of Genes

Microbiome Details -> Phylogenetic Distribution of Genes

Components: histograms Protein Recruitment Plots summary statistics tables lists of genes

histogram(phylum/

class)

gene counts

gene lists

summary statistics

histogram

(family)

histogram

(species)

counts, lists, statistics

counts, lists

recruitment plots

Page 17: Advancing Science with DNA Sequence Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012

Advancing Science with DNA Sequence

Dividing the contigs: Scaffold Cart

• Lists of contigs or genes in Gene Cart

E. g. Microbiome Details -> Genome Statistics -> DNA Scaffolds -> scaffold counts

Scaffold CartFeatures: Scaffold Export Adding all genes to Gene

Cart Function Profile (against

functions in Function Cart) Histograms by GC content,

length and gene count Phylogenetic Distribution

Page 18: Advancing Science with DNA Sequence Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012

Advancing Science with DNA Sequence

All Carts in IMG are interconnected

Gene Cart

Scaffold Cart

Function Cart

Page 19: Advancing Science with DNA Sequence Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012

Advancing Science with DNA Sequence

Dividing the genes by abundance/ by function

• Abundance ProfilesCompare Genomes -> Abundance Profiles Tools

Components:

Common parameters: Normalization (none/scale for size) Type of count (raw counts/estimated gene copies) Type of protein family (COG, Pfam, Enzyme, TIGRfam)

Page 20: Advancing Science with DNA Sequence Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012

Advancing Science with DNA Sequence

Other tools

• Phylogenetic Marker COGsFind Functions -> Phylogenetic Marker COGs

• SNP BLAST and SNP VistaGene Page -> SNP BLAST -> SNP VISTA

IMG/M exercises:http://genomebiology.jgi-psf.org/Content/MGM-11.Feb2012/agenda.html

The first 3 pages are questions without answers; the rest is a cheat sheet

Page 21: Advancing Science with DNA Sequence Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012

Advancing Science with DNA Sequence

Life outside IMG: binning tools

Alignment-based tools• MEGAN – BLAST+LCA

http://www-ab.informatik.uni-tuebingen.de/software/megan• MTR – BLAST+ MTR

http://cs.ru.nl/gori/software/MTR.tar.gz• SOrt-ITEMS – processed BLAST best hit

http://metagenomics.atc.tcs.com/binning/SOrt-ITEMS• CARMA and Web-CARMA – MSA + neighbor-joining tree

http://webcarma.cebitec.uni-bielefeld.deCompositional tools• PhyloPythia – 6-mers, SVM

http://cbcsrv.watson.ibm.com/phylopythia.html• TACOA – 2-6 mers, k-nearest neighbor classifier

http://www.cebitec.uni-bielefeld.de/brf/tacoa/tacoa.html• Phymm and PhymmBL – Interpolated Markov models (IMMs)

http://www.cbcb.umd.edu/software/phymm/• ClaMS – DOR, DBC

http://clams.jgi-psf.org

Page 22: Advancing Science with DNA Sequence Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012

Advancing Science with DNA Sequence

Life outside IMG: statistical analysis tools

Comparison of 2 samples• MEGAN - http://www-ab.informatik.uni-tuebingen.de/software/megan

• STAMP - http://kiwi.cs.dal.ca/Software/STAMP

Comparison of sets of samples• ShotgunFunctionalizeR – R package for statistical

analysis - http://shotgun.zool.gu.se

• METAREP – package from JCVI, includes multidimensional scaling, hierarchical clustering, etc - http://www.jcvi.org/metarep

• METASTATS – package for analysis of paired samples with replicates - http://metastats.cbcb.umd.edu/

• LEfSE – package for comparison of multiple classes of samples with replicates - http://huttenhower.sph.harvard.edu/lefse/