large scale resequencing: approaches and challenges

AGBT Tutorial Workshop 15th February, 2012

Large Scale Resequencing: Approaches and Challenges

Thomas Keane Vertebrate Resequencing Informatics group Wellcome Trust Sanger Institute Hinxton, Cambridge, UK

[email protected]


Sanger total sequence (2007-2009) G

bp


Sanger total sequence to-date G

bp


Vertebrate Resequencing Informatics Group

 Established in 2008 with Jim Stalker  PIs: Richard Durbin and David Adams

 Initial projects  1000 Genomes project (http://www.1000genomes.org)

 Data processing, releases, aligner evaluation, sequencing  Pilot 2008-2009: ~5Tbp (Nature 2011;467)  Phase 1 2009-2011: ~30Tbp  Phase 2 2011-: ~36.9Tbp (LowCov ilmn only)

 Mouse Genomes Project (http://www.sanger.ac.uk/mousegenomes)  Sequencing 17 laboratory mouse strains  SNPs, indels, SVs, de novo assembly  Approx. ~1.2Tbp (Nature 2011;477)


Investigating the role of rare genetic variants in health and disease Whole genome cohorts: 4,000 individuals across two well-established and deeply phenotyped UK cohorts with ongoing longitudinal phenotype collection:   TWINSUK – 2,000  ALSPAC – 2,000   6x (18Gbp) per sample

Exomes: 6,000 exomes from 3 sets of extreme phenotype individuals  Neurodevelopmental diseases – 3,000

 e.g. schizophrenia, autism spectrum disorders  Obesity – 2,000

 e.g. severe childhood onset obesity  Rare diseases – 1,000

 e.g. severe insulin resistance, congenital heart disease, ciliopathies   5Gbp per sample

Expect to generate ~100Tbp by end 2012   ~40Tbp from BGI

UK10K


Current Status

Recently passed 1000 genomes in terms of total Gbp


What are the challenges?

NGS

Storage

Compute Power

Software/Workflows


Data Production Workflow

Merge Up

BAM BAM BAM Library merge Library

NA34842 NA87465 Sample/Platform Sample merge

Import +

Improvement Fastq Fastq Fastq …… Fastq Fastq

BAM BAM BAM BAM BAM Alignment (bwa, smalt etc)

BAM BAM BAM BAM BAM BAM

Improvement ……

……

Freeze


Data Production Workflow

Cross-sample BAMs

Merge across

… Chr1 Chr2 Chr3

NA19294

NA18943

NA19305 . .

NA19309

…

…

RG:NA19294 RG:NA18943 RG:NA19305

.

.

.

.

.

.

.

.

.

Variant Calling

samtools GATK

VQSR

BEAGLE/Impute2

Genome STRiP

Final VCF

VEP Annotation

SVMerge SNPs/indels


Storage Challenges

Expect ~200Tbp of sequence in 2011-2012  Working estimate including processing, release, and variant calling  10bytes per bp

Storage considerations  Scalability – can we easily add more storage units?  Backup and disaster recovery – what do we really need to keep?  Performance – sufficient I/O throughput to serve compute nodes  Cost

Data Formats  Standardised formats – BAM & VCF

Minimise the number of copies  Aim for two copies at most – original lanes + release (stripped) BAM


A Tiered Storage Solution

Off-site

Off-site

3Gb/sec

800Mb/sec

CP

U Farm

Cost

2

1

2

Size

1

3

2 Level 1

  Data: Current release vertical BAMs   Processes: BAM merging + splitting, Variant calling (SNPs, indels, SVs)

Level 2   Data: Lane level BAMs   Processes: Alignment, recalibration, local realignment

Level 3   Data: Previous release BAMs + variant calls backup


Data release + archiving: iRODs

Rule-Oriented Data management systems   Open source – origins in particle physics world   Most important feature of iRODS is the Rule Engine   Akin to source control system

Customise own application level metadata   e.g. run, lane, plex, sample, library….

Stores/searches key-value metadata on files:   List all files from UK10K studies:

imeta -z seq qu -d study like 'UK10K_%’!/seq/5363/5363_1.bam!/seq/5363/5363_2.bam (.....and a whole lot more)!

  Get metadata about a file: imeta ls -d /seq/6534/6534_3#7.bam sample!

attribute: sample!value: QTL191953!

Sanger production: BAM files from runs per lane per plex deposited   BMC Bioinformatics 2011, 12:361

Recently adopted for UK10K internal data release and archiving   Users use meta-data queries to find their data   Files can be part of multiple releases

nfs03

nfs02

nfs01

nfs20

Off-site

iRODs

http://www.irods.org


Compute Pipeline Management: VRPipe

VRPipe  Managed and automated execution of sequences of arbitrary

software against massive datasets across large compute clusters  Error handling, optimal memory requests, batching of jobs, retrying

failures, failure reporting, highly extendable, detailed job statistics 1000 Genomes Phase 2 processed through VRPipe  Tracked ~1 million jobs  Total serial wall time: 9886 days, 3 hrs, 43 mins, 25 secs  bwa_aln_fastq: ~2443 days total serial wall time  Mean memory: 941MB/job (max 5637)

2012  Fully migrate all NGS processes to VRPipe (data processing, SNP/

indel/SV variant calling, and RNA-seq/ChIP-Seq pipelines)  Management front-ends  Create distributable VM for cloud rollout

http://www.github.com/VertebrateResequencing/vr-pipe/wiki

[email protected]


Even more scale up in 2012 – HiSeq 2500

Currently takes 1-2 weeks to sequence a human genome  High depth human genomes in a single day – Illumina HiSeq

2500  Caucasian family with a severe T-cell deficiency in affected

sibling  Single run on HiSeq 2500 by Illumina per individual

Sample PF

Yield (Gbp)

% Align % ≥Q30 value

Mismatch R1 (%)

Mismatch R2 (%)

Run time (hrs)

Father 117.7 89 92.6 0.4 0.5 25.5 Mother 125.7 90.2 92.8 0.4 0.5 25.5

Affected 124.4 90.3 92.4 0.4 0.5 25.5


What does the data look like?


Upcoming Changes in 2012

We cannot keep all of the data  2007-2008: Keep everything including images from runs  2009: BAM/Fastq – all of the base quality information  2010-2011: Stripping original qualities and other unused tags  2012-: Current formats contain lots of repetition

 Reference based compression  Reducing quality information e.g. quality binning or quality

budgets  Potential formats: CRAM and/or Reduced BAM


CRAM Format

0.1 1 10 100

TGAGCTCTAAGTACC!329183050298757!

-2---30---9---7!TGAGCTCTAAGTACC!

002020010022212!TGAGCTCTAAGTACC!

Do nothing Lossless Quality lossy

Horizontal Vertical

CRAM models for compression

CRAM combination

model

CRAM lossless

Untreated CRAM substitutions/insertions model

CRAM current performance

CRAM v0.6 released 13.2.12: •  Pairing information preservation regardless of distance •  Revised and improved lossless mode

•  Option to preserve all unmapped reads •  Performance and bug fixes •  Arbitrary tags

http://www.ebi.ac.uk/ena/about/cram_toolkit Source: Ewan Birney/Guy Cochrane, EBI


URLs •  VRPipe: https://github.com/VertebrateResequencing/vr-pipe •  iRODS@Sanger: BMC Bioinformatics 2011, 12:361 •  http://www.slideshare.net/thomaskeane

Any questions?

Richard Durbin

David Adams

large scale resequencing: approaches and challenges

Technology

bamagbt tutorial workshop

dategbpagbt tutorial

bgiagbt tutorial workshop

data files

tutorial workshop15th

uk10k internal data

vrpipe data processing

bam files