pamela ferretti laboratory of computational metagenomics centre for integrative biology

23
Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1

Upload: liseli

Post on 23-Feb-2016

65 views

Category:

Documents


0 download

DESCRIPTION

Microbial Genome Assembly. Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy. Outline-summary. 1 . QUICK INTRODUCTION. 2 . GENOME ASSEMBLY. 3 . ASSEMBLY STRATEGIES. 4 . CASE STUDY. DNA packaging. DNA packaging. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Pamela  Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology

1

Pamela Ferretti

Laboratory of Computational Metagenomics

Centre for Integrative BiologyUniversity of Trento

Italy

Microbial Genome Assembly

Page 2: Pamela  Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology

2

Outline-summary

4. CASE STUDY

2. GENOME ASSEMBLY

3. ASSEMBLY STRATEGIES

1. QUICK INTRODUCTION

Page 3: Pamela  Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology

3

DNA packaging

Page 4: Pamela  Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology

4

DNA packaging

Page 5: Pamela  Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology

5

Outline-summary

4. CASE STUDY

2. GENOME ASSEMBLY

3. ASSEMBLY STRATEGIES

1. QUICK INTRODUCTION

Page 6: Pamela  Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology

6

Next Generation Sequencing

TCTTATTGTGACC TAGGCTAGCTTAG

GCAATGCAGTAAC TCCAGCTAGGTTC

ACGTAGGCTAGCGTTAGCGA ........ CTGCAT C

Page 7: Pamela  Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology

7

Genome Assembly

1. GENOME SEQUENCING2. PRELIMINARY ANALYSIS3. ASSEMBLY4. ADVANCED BIOINFORMATIC ANALYSIS

OVERLAPPING SEQUENCE ALIGMENT

Page 8: Pamela  Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology

Sequencing the human genome with shotgun sequencing + assembly is the only feasible strategy

Computational assembly of shotgun sequencing data is simply unfeasible, and a bad idea anyway

Weber, James L., and Eugene W. Myers. "Human whole-genome shotgun sequencing." Genome Research 7.5 (1997): 401-409.

Green, Philip. "Against a whole-genome shotgun.“Genome Research 7.5 (1997): 410-417.

They were both right!(…well, Weber and Myers were a bit more right from the practical viewpoint…)

On the feasibility of sequence assembly

Page 9: Pamela  Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology

9

Outline-summary

4. CASE STUDY

2. GENOME ASSEMBLY

3. ASSEMBLY STRATEGIES

1. QUICK INTRODUCTION

Page 10: Pamela  Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology

10

Genome assembly strategies Greedy approach → SSAKE

De Bruijn graph (DBG) → Velvet, SOAPdenovo

Overlap Consensus Layout (OLC) → MIRA

Mixed approaches → MaSuRCA

Page 11: Pamela  Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology

11

Genome assembly strategies DE BRUIJN GRAPH APPROACH (DBG)

Velvet, SOAPdenovo2

Nodes = overlapping sequences of reads of uniform lengthEdges = kmer (unique subsequences within reads)

EULERIAN PATH

Page 12: Pamela  Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology

12

Genome assembly strategies

OVERLAP CONSENSUS LAYOUT (OLC)

MIRA

Nodes = readsEdges = overlap between reads

1. OVERLAP2. LAYOUT3. CONSENSUS

HAMILTONIAN PATH

Page 13: Pamela  Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology

13

Genome assembly strategies

Page 14: Pamela  Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology

14

Genome assembly strategies

DBG OLC

ADVANTAGES Very sensitive to repeats Modular algorithmic design

Kmer storaged just once Flexibility and robustness

Eulerian cycle

Never explicitly computes pairwise computation

DISADVANTAGES Sensitive to sequencing errors (new k-mers)

Hamiltonian cycle

Large computational memory space requirements

Overlap stage istime-consuming

Genome-size limitations

Page 15: Pamela  Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology

15

Greedy approach → SSAKE

De Bruijn graph (DBG) → Velvet, SOAPdenovo

Overlap Consensus Layout (OLC) → MIRA

Mixed approaches → MaSuRCA

Genome assembly strategies

Page 16: Pamela  Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology

16

Genome Assemblers

Average CoverageNumber of ContigsNumber of Contigs > 1KbN50 contig sizeFraction of reads assembledTotal consensus (in nt)Number of scaffolds N50 scaffolds size

Ion Torrent PGM → MIRA 3.9

Illumina → MaSuRCA MIRA 3.9 too produced good quality results, but it has a longer execution time

and it becomes unstable with large amount of small reads

Page 17: Pamela  Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology

17

Outline-summary

4. CASE STUDY

2. GENOME ASSEMBLY

3. ASSEMBLY STRATEGIES

1. QUICK INTRODUCTION

Page 18: Pamela  Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology

18

Mycobacteria Assembly: Case Study

Responsible for many animal and human diseases M. tuberculosis and M. leprae (TM)M. fortuitum (NTM) outbreak (nail salon, 2002)M. chelonae (NTM) outbreak (face lifts, 2004)

Illumina HiSeq sequencing (NGS Facility – CIBIO/UNITN) Twenty mycobacterial strains From 20 different Mycobacteria species

→ MaSuRCA

Novel mycobacteria detection clinical tests

Page 19: Pamela  Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology

19

Fastq-mcf tool

• poor quality ends of reads• Ns, duplicates and sequencing

adapters• reads that are too short

Reduction up to 73%

Raw data quality assessment and pre-processing

Page 20: Pamela  Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology

20

K-mers: strings of a particular length k, which are shorter than entire reads

Best empirical k-mer length: 91 bases long

Assembly parameters setting

High coverage

Page 21: Pamela  Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology

21

MaSuRCA results of Mycobacteria

Abnormal GC content

Genome size too high

Page 22: Pamela  Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology

22

Examples of environmental contaminations

GC content based quality analysis

Staphylococcus epidermidis

Page 23: Pamela  Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology

Thanks

Photocoming

soon

http://gcat.davidson.edu/phast/#methods