genomevip: a genomics analysis pipeline for cloud computing with germline and somatic calling on...

16
GenomeVIP: A Genomics Analysis Pipeline for Cloud Computing with Germline and Somatic Calling on Amazon’s Cloud R. Jay Mashl October 20, 2014

Upload: dennis-roberts

Post on 02-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

GenomeVIP:

A Genomics Analysis Pipeline for Cloud Computing with

Germline and Somatic Calling on Amazon’s Cloud

R. Jay Mashl

October 20, 2014

Turnkey Variant AnalysisProject

tvap.genome.wustl.edu

Pindel

• Multi-tool Variant

discovery

• Cloud computing

• Scalability

• Extensibility

VarScanBreakDancer GenomeSTRiP

Provides a collection of analysis tools and computational frameworks for streamlined discovery and interpretation of genetic variants

localCloud(AWS)

Genome Variant Investigation Portal

Poster #1678M (Monday)

Genome Variant Investigation Portal

• Web server and interface for germline and somatic variant-discovery tools

• Concurrent pipelines (SNV, indel, SV) with parallelization

• Launchable on local machines or on the cloud through Amazon Web

Services (AWS)

• Download results from AWS via web browser

Pindel

VarScan

BreakDancer

GenomeSTRiP(Harvard U.)

Heuristic/statistical calling of single nucleotide variants (SNVs)

Indel detection for paired reads based on local realignment

Structural variant (SV) detection for paired reads

Structural variant detection and genotyping

Biological Discoveries (selected)

Discovery & genotyping for structural variants in populations• ~14,000 deletion polymorphisms with allelic states (1000G pilot)• Nature Genetics 43, 269-276 (2011)

Comprehensive molecular portraits of human breast tumours• Identified four main types by combining data from five platforms• Nature 490, 61-70 (2012)

Genomic Landscape of Non-Small Cell Lung Cancer in Smokers and Never-Smokers

• Of patients with lung cancer, smokers found to have10x more mutations than non-smokers

• Cell 150, 1121-34 (2012)

Clonal evolution in relapsed acute myeloid leukaemia• “Cancer” consists of multiple variants; founding clone may give

rise to relapse clone; subclones may survive therapy and mutate further

• Nature 481, 506-510 (2012)

Application to APOL1: Demo

• Representative samples from PUR population from 1000 Genomes

• Analyze within the range chr22 : 36-37 Mbp for known variants:

Sample Region Variant Isoforms

HG01242 22:36,661,906 A / G G1(non-silent)

HG01101 22:36,662,041 AATAATT / A G2 (D6)

HG01049 22:36,133,448 D 767bp

Login

• Select AWS

• Click Next

Sample & Reference Selection

• Entering path: Copy the given URI. Click Retrieve.

• Click on all the PUR low_coverage items to transfer them to the Selected bams textbox.

• Select reference hs37d5.chr22.fa.

• Click Next.

Specify path & retrieve

Select sample

s

Select reference (hs37d5.chr22.f

a)

SNV Detection: VarScan

• CheckVarScan• Select Germline• Select SNVs only• Select All (pooled)

samples• Select User-defined

region and enter “22:36130000-

36700000”• Keep p-value: 0.99• Set Output vcf: True• Click Next.

SNV

All

22:36130000-

36700000

Indel Detection: Pindel

• Check Run Pindel• Select All (pooled)

samples• Select User-defined

region and enter “22:36130000-

36700000”• Click Next.

22:36130000-

36700000

Select All

SV Detection: BreakDancer

• Check BreakDancer• In Step 1, select All

(pooled) samples• In Step 3, select

Intra (ITX) only, user-defined region and enter “22:36130000-

36700000”• Click Next.

22:36130000-

36700000

SV Detection & Genotyping: GenomeSTRiP

1. Check Run GenomeSTRiP

2. Verify reference is

hs37d5.chr22.fa

3. Select mask

human_g1k_v37.mask.36.fasta

4. GC normalization: True, with

cn2_mask_g1k_v37.fasta

5. Chromosome: User-defined

with “22:36130000-36700000”

6. Variant size: 100bp – 100 kbp.

Hs37d5.chr2

2

100bp-

100kbp

Amazon AWS Submission

• Jobs have been tested to finish within a few minutes

Select machine

type

Where to send results

Validate & submit

Results

22 36133341 DEL_1 T <DEL> …SVLEN=-762;SVTYPE=DEL

22 36662041 . AATAATT A . PASS END=36662047;HOMLEN=4;HOMSEQ=ATAA;SVLEN=-6;SVTYPE=DEL;

22 36661906 . A G . PASS ADP=7;WT=1;HET=0;HOM=0;NC=2

22 36662041 . AATAATT A . PASS ADP=4;WT=0;HET=1;HOM=0;NC=2;

http://tvap.genome.wustl.edu/

Poster #1678 / M(this afternoon)

Jay Mashl(rmashl @

genome.wustl.edu)

Kai Ye(kye @

genome.wustl.edu)

Li Ding(lding @

genome.wustl.edu)

...and with thanks to the Ding Lab members

National HumanGenome ResearchInstitute

Alternate slides

Amazon AWS S3 Data Retrieval

• Links to actual files to be generated, along with merged VCF

• Participants will identify variants in the output

• (Left) Prepared results available, in case of technology problems

Click links to downloa

d