next generation sequencing in cloud computing era

Post on 10-May-2015

1.503 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Some discussion slides from a recent discussion session in Cambridge

TRANSCRIPT

Vertebrate Resequencing Informatics 17th November, 2009

Files, Tools, and Bioinformatics in the Cloud

Thomas Keane

Vertebrate Resequencing Informatics WTSI thomas.keane@sanger.ac.uk

Vertebrate Resequencing Informatics 17th November, 2009

DATA is the problem!

NGS means large volumes of raw data   Previously SRF (~8-10bytes per bp), now BAM (~1.6bytes per bp)

How much data can a sequencing machine produce?   20Gbp per lane, 16 lanes per run (1 run = 1.5 weeks) => 11Tbp/year   Small sequencing center: 4 machines?   44Tbp per year!

Raw data in BAM: 70Tbytes Processed calls much smaller   1000G pilot VCF < 1Gbyte

Alignment + BAM improvement

SV Calling: SVMerge

Vertebrate Resequencing Informatics 17th November, 2009

Simplistic Model: Cloud as compute resource

Processes 1. Align

BAM VCF

SRF/Fastq/BAM (2Mbps/sec)

BAM + VCF (2Mbps/sec)

Variant calling (n x SNP callers, n indel callers, SV callers)

Sequencing Center/Institute

3,240 days to upload!

Vertebrate Resequencing Informatics 17th November, 2009

Move the raw data generation to the compute

VCF

Variant calling (n x SNP callers, n indel callers, SV callers)

BAM VCF

Sequencing Center/Institute

Vertebrate Resequencing Informatics 17th November, 2009

Large Collaborative Projects: Cloud centric model

VCF

Analysis groups

top related