next generation sequencing in cloud computing era
DESCRIPTION
Some discussion slides from a recent discussion session in CambridgeTRANSCRIPT
Vertebrate Resequencing Informatics 17th November, 2009
Files, Tools, and Bioinformatics in the Cloud
Thomas Keane
Vertebrate Resequencing Informatics WTSI [email protected]
Vertebrate Resequencing Informatics 17th November, 2009
DATA is the problem!
NGS means large volumes of raw data Previously SRF (~8-10bytes per bp), now BAM (~1.6bytes per bp)
How much data can a sequencing machine produce? 20Gbp per lane, 16 lanes per run (1 run = 1.5 weeks) => 11Tbp/year Small sequencing center: 4 machines? 44Tbp per year!
Raw data in BAM: 70Tbytes Processed calls much smaller 1000G pilot VCF < 1Gbyte
Alignment + BAM improvement
SV Calling: SVMerge
Vertebrate Resequencing Informatics 17th November, 2009
Simplistic Model: Cloud as compute resource
Processes 1. Align
BAM VCF
SRF/Fastq/BAM (2Mbps/sec)
BAM + VCF (2Mbps/sec)
Variant calling (n x SNP callers, n indel callers, SV callers)
Sequencing Center/Institute
3,240 days to upload!
Vertebrate Resequencing Informatics 17th November, 2009
Move the raw data generation to the compute
VCF
Variant calling (n x SNP callers, n indel callers, SV callers)
BAM VCF
Sequencing Center/Institute
Vertebrate Resequencing Informatics 17th November, 2009
Large Collaborative Projects: Cloud centric model
VCF
Analysis groups