presenter: huy vuong, phd department of biomedical informatics vanderbilt university 5/3/2013

34
Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013 Detection of somatic mutations: A data mining and a computational approach

Upload: greta

Post on 25-Feb-2016

63 views

Category:

Documents


5 download

DESCRIPTION

Detection of somatic mutations: A data mining and a computational approach. Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013. Somatic single nucleotide variants ( sSNV ). Play major role in tumorigenesis and cancer development - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

Presenter: Huy Vuong, PhDDepartment of Biomedical InformaticsVanderbilt University5/3/2013

Detection of somatic mutations: A data mining and a computational approach

Page 2: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

Somatic single nucleotide variants (sSNV)

• Play major role in tumorigenesis and cancer development

• Aim 1: Literature mining• Catalogue of Somatic Mutations

In Cancer (COSMIC): the most comprehensive catalogue today

• Aim 2: Tumor-specific mutations in tumor-normal pairs

2

V1 (2004) V60 (7/2012)

V61 (9/2012)

V62 (11/2012)

10,647

340,585

405,271

745,924

Mutations in COSMIC

Page 3: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

Classes of somatic mutations• Point mutation:

• Coding• Silent• Missense• Nonsense

• Noncoding (UTR, ncRNA, miRNA…)• Intronic• Intergenic

• Small scale mutation: • Small insertions• Small deletions

• Large scale mutation: rearrangements• Intrachromosomal

• Deletion• Invertion• Duplication

• Interchromosomal• Translocation• Insertion

Page 4: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

Aim 1: Mining COSMIC For Protein Domain Interaction

4

Page 5: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

History of COSMIC

The Evolution of the Cosmos started with the Big Bang!http://en.wikipedia.org/wiki/Big_Bang

Page 6: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

Yet, another COSMIC• History of the Catalogue Of Somatic Mutations In Cancer (Wellcome

Trust Sanger Institute)

COSMIC V1(4th February, 2004)

COSMIC V64(26th March, 2013)

Genes Mutations Tumours

4 10,64757,44424,394

913,166847,698

V1 (2004) V64 (2013)

Comparison V1 vs. V64

Page 7: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

Advantages and Disadvantages

• Bimonthly updates• Manual curated data,

removed low quality data• Consistent vocabulary

(histology and tissue)• Mutation maps to single

version of gene (no alternative splicing)

• FREE availability!!!

• Curation bias• Many positive results, few

negative results• Other quality issues:

experimental error, missing mutations

• Interpretation of mutation frequency

Page 8: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

Typical workflowHistogram

Distribution

Page 9: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

Specific aims

• Map somatic mutations (SM) in COSMIC to

protein structural model

• Identify SM in pocket region of protein

• Use statistical analysis to score SM in the

context of cancer (specificity, sensitivity)

Page 10: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

Dataset and preprocessing step• Data are downloaded from COSMIC version 62 via Biomart interface

as TSV file (http://cancer.sanger.ac.uk/biomart/martview/)• Use R to clean the data (i.e remove duplicates) and import to a

SQLite database• Database contained 776,917 mutations and 15 variables:

1. Gene.Name 2. CDS.Mutation.Syntax 3. AA.Mutation.Syntax 4. Zygosity 5. Primary.Site 6. Primary.Histology 7. In.Cancer.Census 8. Tumour.Source

9. Genomic.Coordinates.GRCh37 10. CDS.Mutation.Type 11. AA.Mutation.Type 12. Somatic.status 13. Validation.status 14. Entrez.Gene.ID 15. COSMIC.Sample.ID

Page 11: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

Vast majority of disease-associated SNPs are located in Pockets. (Tseng and Li, PNAS, 2011)

Protein pocket region • Li et al developed algorithm to identify

functional pocket regions in protein

Page 12: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

A case study: KRAS

About 64% of SM in KRAS is located on the functional pocket region

Yu et al (Nature Biotechnology, 2012) also reported about 65% of disease associated in-frame mutations are located on the interaction surfaces of proteins associated with the diseases.

Page 13: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

15

Aim 2: Tumor-specific mutations in tumor-normal pairs

Page 14: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

Outline

• Challenges in detecting somatic single nucleotide variants (sSNV)

• GATK pipeline for calling sSNV• Installing and running MuTect• MuTect output• Summary

16

Page 15: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

Detecting sSNV in cancer: challenge #1

Many sSNV occur at very low frequency in genome (0.1 to 100 mutations per megabase) 17

Slide adapted from Mike Lawrence, TCGA Annual Symposium

Page 16: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

C. Tri-clonal tumor

Detecting sSNV in cancer: challenge #2

Tumors are impure (i.e. contain normal contaminating cells) and heterogeneous (i.e. contain sub-clones)

18

Slide adapted from Christopher Miller, TCGA Annual Symposium and Mardis Elaine

Page 17: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

GATK pipeline

GATK Best Practices: http://www.broadinstitute.org/gatk/guide/topic?name=best-practices

Page 18: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

NGS: Resources

• SEQanswers (http://seqanswers.com/)• SEQanswers software list (http://

seqanswers.com/wiki/Software/list• Galaxy (https://main.g2.bx.psu.edu/)• NGS Catalog (

http://bioinfo.mc.vanderbilt.edu/NGS/)

Slide adapted from Peilin Jia, PhD

Page 19: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

Two types of error

• USER ERRORS: • Due to wrong command line or incorrect user

input files• Please do not post this error to the GATK

forum• RUNTIME ERRORS:

• Due to the program code• Do post this error to the GATK forum (together

with the trace file)

Page 20: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

USER ERROR• ##### ERROR ------------------------------------------------------------------------------------------• ##### ERROR A USER ERROR has occurred (version 2.2-25-g2a68eab): • ##### ERROR The invalid arguments or inputs must be corrected before the GATK can

proceed• ##### ERROR Please do not post this error to the GATK forum• ##### ERROR• ##### ERROR See the documentation (rerun with -h) for this tool to view allowable

command-line arguments.• ##### ERROR Visit our website and forum for extensive documentation and answers to • ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk• ##### ERROR• ##### ERROR MESSAGE: SAM/BAM file

SAMFileReader{/scratch/vuongh/Lungevity_Project/GATK/bwa/13_karosorted_RG_MarkDup_Realigned_Recal.bam} is malformed: read starts with deletion. Cigar: 9D18M15I38M26S. Although the SAM spec technically permits such reads, this is often indicative of malformed files. If you are sure you want to use this file, re-run your analysis with the extra option: -rf BadCigar

Page 21: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

BEST OF RUNTIME ERROR• ##### ERROR

------------------------------------------------------------------------------------------• ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.4-7-

g5e89f01):• ##### ERROR• ##### ERROR Please visit the wiki to see if this is a known problem• ##### ERROR If not, please post the error, with stack trace, to the GATK

forum• ##### ERROR Visit our website and forum for extensive documentation

and answers to • ##### ERROR commonly asked questions

http://www.broadinstitute.org/gatk• ##### ERROR• ##### ERROR MESSAGE: START (0) > (-1) STOP -- this should never

happen -- call Mauricio!

Page 22: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

MuTect: a highly sensitive and specific sSNV caller

• Distinct Features • Focus on identifying low allelic fraction

mutations due to tumor heterogeneity, normal contaminating cell, sub-clones

• Use Bayesian model with allelic fraction as parameter yield high sensitivity

• Carefully tuned , elaborated set of filters yield high specificity

24

Page 23: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

Overview of the detection of a somatic point mutation using MuTect

25

Bayesian model

Variant Filter Panel of Normal Filter

Cibulskis, K. et al.Nat Biotechnology (2013).doi:10.1038/nbt.2514

Page 24: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

Benchmarking mutation-detection methods

26

Advantages: High sensitivity at low allelic fraction (f=0.1)High specificity achieved by filters

Cibulskis, K. et al.Nat Biotechnology (2013).doi:10.1038/nbt.2514

Page 25: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

Filter options• Proximal gap• Poor mapping• Triallelic site• Strand bias• Clustered position• Observed in Control• Panel of normal samples

27

Good BadJia et al. PLoS ONE 7(6): e38470

Strand bias

Page 26: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

Installing MuTect

• Installation (Linux)• Version 1.1.4 available for download at

http://www.broadinstitute.org/cancer/cga/mutect_download (must register an account at Broad)

• Can also be built from source available for download at http://www.nature.com/nbt/journal/v31/n3/extref/nbt.2514-S3.zip 28

Page 27: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

Preparing input• Resources:

• COSMIC VCF file: use b37_cosmic_v54_120711.vcf • dbSNP VCF file: use dbsnp_132_b37.leftAligned.vcf.gz• Human reference fasta: downloaded from GATK

reference bundle, use Homo_sapiens_assembly19.fasta, *.fai, *.dict files

• Inputs:• Tumor bam file and matched normal bam file from

read alignment tool output (e.g. BWA, Tophat)• Bam files needed to be sorted and indexed. • Recommendation: corrected for local indels

realignment, marked for PCR duplicates according to GATK best practice variant detection

29

Page 28: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

java -Xmx4g -jar /scratch/vuongh/mutect_latest/muTect-1.1.4.jar \ --analysis_type MuTect \ --reference_sequence /ref/Homo_sapiens_assembly19.fasta \-cosmic /ref/hg19_cosmic_v54_120711.vcf \-dbsnp /ref/dbsnp_132_b37.leftAligned.vcf \--input_file:normal /Huy-RNAseq/1/accepted_hits.sorted.RG.bam \--input_file:tumor /Huy-RNAseq/2/accepted_hits.sorted.RG.bam \--out /out/1_2_cal_stats.out \--vcf /out/1_2_mutation.vcf \-cov /out/1_2_coverage.wig.txt \--enable_extended_output

Running MuTect• Command line with all default parameter

30 Notes:

• Put all resource files (COSMIC, dbSNP and reference fasta) in folder ref• Normal bam file and index in folder 1, turmor bam and index in folder 2. • Output call stats and vcf file of mutation candidates in folder out

Page 29: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

Result

• Test data: RNA-seq data from squamous cell lung cancer patients (tumor/normal pair)

• Total run time: 6 hours on 8 Intel Nehalem CPUs (2.4 GHz) and, processed 65.1 million reads per sample

• View the result with Excel

31

Page 30: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

Example of Mutect output

32

contig position ref_allele alt_allele t_lod_fstar tumor_f contaminant_lod failure_reasons judgem

ent

1 14470 G A 8.631487 0.272727 -0.096458normal_lod,alt_allele_in_normal,poor_m

apping_region_alternate_allele_mapq REJECT

1 14542 A G 4.993144 0.076923 -0.228097fstar_tumor_lod,possible_contamination,

normal_lod,alt_allele_in_normal REJECT

1 14574 A G 4.82618 0.071429 -0.245647 fstar_tumor_lod,possible_contamination REJECT

1 14653 C T137.96602

6 0.714286 -0.429894 normal_lod,alt_allele_in_normal REJECT

1 14673 G C 5.07638 0.030769 2.317242

fstar_tumor_lod,possible_contamination,alt_allele_in_normal,poor_mapping_regi

on_alternate_allele_mapq REJECT1 139393 G T 8.97833 0.3 -0.087734 KEEP

1 788867 C T 7.335518 0.285714 -0.061414 KEEP

1 1321326 C G 7.495658 0.333333 -0.052641 KEEP1 1498692 T C 6.681093 0.2 -0.087736 KEEP

1 1498813 T C 6.706235 0.166667 -0.105281 KEEP

Keep: 1143 (0.5%) %Reject: 213000 (99.5%)

Page 31: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

Distribution of keep versus reject calls

33

• Most reject calls are high allelic fraction sSNV

• Keep most of the low-allelic fraction sSNV

• Mono-clonal ???

Allelic fraction f

Density plot with cutoff threshold = 6.3

dens

ity

Page 32: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

Effect Variant annotation Chr Start End Ref Altnonsynonymous

SNVCLSTN1:NM_014944:exon2:c.C163T:p.L55F,CLSTN1:NM_

001009566:exon2:c.C163T:p.L55F, 1 9833381 9833381 G Astopgain SNV MASP2:NM_006610:exon10:c.T1236A:p.C412X, 1 11090294 11090294 A T

nonsynonymous SNV

VPS13D:NM_018156:exon63:c.G11985C:p.L3995F,VPS13D:NM_015378:exon64:c.G12060C:p.L4020F, 1 12475169 12475169 G C

nonsynonymous SNV DHRS3:NM_004753:exon6:c.G852C:p.E284D, 1 12628426 12628426 C G

nonsynonymous SNV RSC1A1:NM_006511:exon1:c.C1741T:p.L581F, 1 15988104 15988104 C T

nonsynonymous SNV

RAP1GAP:NM_001145657:exon9:c.T297A:p.H99Q,RAP1GAP:NM_001145658:exon8:c.T489A:p.H163Q,RAP1GAP:

NM_002885:exon8:c.T297A:p.H99Q, 1 21940577 21940577 A Tstopgain SNV HSPG2:NM_005529:exon41:c.C5053T:p.R1685X, 1 22186457 22186457 G A

nonsynonymous SNV RPL11:NM_000975:exon2:c.C7G:p.Q3E, 1 24019099 24019099 C G

nonsynonymous SNV RPL11:NM_000975:exon2:c.A8C:p.Q3P, 1 24019100 24019100 A C

synonymous SNVRPS6KA1:NM_002953:exon22:c.G2207A:p.X736X,RPS6K

A1:NM_001006665:exon21:c.G2234A:p.X745X, 1 26900691 26900691 G A

34

Variant annotation (Annovar)

Display 10 out of 432 genes

Page 33: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

Summary• MuTect is a highly sensitive and specific tool

for somatic SNVs calling• Designed to detect low allelic fraction somatic

mutations in as few as 10% of cancer cells• Easy to install and run on all OS• Work on all NGS data• Limitations:

• Computational intensive• Can’t call indels

35

Page 34: Presenter: Huy Vuong, PhD Department of Biomedical Informatics Vanderbilt University 5/3/2013

THANK YOU

36