next generation sequencing in cloud computing era

5
Vertebrate Resequencing Informatics 17 th November, 2009 Files, Tools, and Bioinformatics in the Cloud Thomas Keane Vertebrate Resequencing Informatics WTSI [email protected]

Upload: thomas-keane

Post on 10-May-2015

1.503 views

Category:

Technology


1 download

DESCRIPTION

Some discussion slides from a recent discussion session in Cambridge

TRANSCRIPT

Page 1: Next generation sequencing in cloud computing era

Vertebrate Resequencing Informatics 17th November, 2009

Files, Tools, and Bioinformatics in the Cloud

Thomas Keane

Vertebrate Resequencing Informatics WTSI [email protected]

Page 2: Next generation sequencing in cloud computing era

Vertebrate Resequencing Informatics 17th November, 2009

DATA is the problem!

NGS means large volumes of raw data   Previously SRF (~8-10bytes per bp), now BAM (~1.6bytes per bp)

How much data can a sequencing machine produce?   20Gbp per lane, 16 lanes per run (1 run = 1.5 weeks) => 11Tbp/year   Small sequencing center: 4 machines?   44Tbp per year!

Raw data in BAM: 70Tbytes Processed calls much smaller   1000G pilot VCF < 1Gbyte

Alignment + BAM improvement

SV Calling: SVMerge

Page 3: Next generation sequencing in cloud computing era

Vertebrate Resequencing Informatics 17th November, 2009

Simplistic Model: Cloud as compute resource

Processes 1. Align

BAM VCF

SRF/Fastq/BAM (2Mbps/sec)

BAM + VCF (2Mbps/sec)

Variant calling (n x SNP callers, n indel callers, SV callers)

Sequencing Center/Institute

3,240 days to upload!

Page 4: Next generation sequencing in cloud computing era

Vertebrate Resequencing Informatics 17th November, 2009

Move the raw data generation to the compute

VCF

Variant calling (n x SNP callers, n indel callers, SV callers)

BAM VCF

Sequencing Center/Institute

Page 5: Next generation sequencing in cloud computing era

Vertebrate Resequencing Informatics 17th November, 2009

Large Collaborative Projects: Cloud centric model

VCF

Analysis groups