variant calling and how to prioritize somatic mutations and inheritated variants in ngs data: an...

28
Variant calling and how to prioritize somatic mutations and inheritated variants in NGS data: an advanced overview Sophia Derdak, Data Analysis Team CNAG [email protected]

Upload: vall-dhebron-institute-of-research-vhir

Post on 14-Jul-2015

320 views

Category:

Health & Medicine


3 download

TRANSCRIPT

Variant calling and how to prioritize somatic mutations and inheritated variants in NGS data:

an advanced overview

Sophia Derdak, Data Analysis Team CNAG

[email protected]

1. Call genomic variants

2. Identify somatic mutations in cancer (2-sample comparisons)

3. Identify variants for Mendelian disorders (trio analysis)

4. Prioritize candidates for follow-up from previously selected variants

- advanced technical filtering and IGV visual inspection

- snpEff functional annotations

- population frequencies

Overview

1. Call genomic variants

From alignments to single nucleotide variants

bam processing

realignment

variant calling

annotations

filtering

remove duplicates

.bam alignment file

.vcf variant call file

inspect alignments

+ .fasta reference genome file

+ study-specificinformation

Identification of genetic differences in comparison to a reference

Variants specific for● strain

● variety● cultivar

● race

InheritanceDe novo variants

Somatic variants Affected vs. controlgroup

Variant calling: applications

Identification of genetic differences in comparison to a reference

Variant calling: Variants in the diploid genome

reference

The true diploid

genome of the sample

TGGACCATCTGGTTGAGCATGTGGGGGTCAACTTCCACATTCCCAGGGAGCCCCCGGTGGACCATCTGGTTGAGCACGTGGGGGTCAACTTCCACATTCCCAGGGAGGCCCCGG

TGGACCATCTGGTTGAGCATGTGGGGGTCAACTCCCACATTCCCAGGGAGCCCCCGG

ref/ref0/0

homozygous reference

ref/alt0/1

heterozygous

alt/alt1/1

homozygous alternative

ref/alt0/1

heterozygous

reference

The true diploid

genome of the sample

Identification of genetic differences in comparison to a reference

TGGACCATCTGGTTGAGCATGTGGGGGTCAACTTCCACATTCCCAGGGAGCCCCCGGTGGACCATCTGGTTGAGCACGTGGGGGTCAACTTCCACATTCCCAGGGAGGCCCCGG

TGGACCATCTGGTTGAGCATGTGGGGGTCAACTCCCACATTCCCAGGGAGCCCCCGG

Alignedsequencing data derived from the

sample

ref/ref0/0

homozygous reference

ref/alt0/1

heterozygous

alt/alt1/1

homozygous alternative

TGGACCATCTGGTTGAGCATGTGGGGGTCAACTTCCACATTCCCAGGGAGCCCCCGGTGGACCATCTGGTTGAGCACGTGGGGGTCAACTTCCACATTCCCAGGGAGCCCCCGGTGGACCATCTGGTTGAGCACGTGGGGGTCAACTTCCACATTCCCAGGGAGCCCCCGGTGGACCATCTGGTTGAGCATGTGGGGGTCAACTTCCACATTCCCAGGGAGCCCCCGGTGGACCATCTGGTTGAGCATGTGGGGGTCAACTTCCACATTCCCAGGGAGCCCCCGGTGGACCATCTGGTTGAGCACGTGGGGGTCAACTTCCACATTCCCAGGGAGGCCCCGGTGGACCATCTGGTTGAGCATGTGGGGGTCAACTTCCACATTCCCAGGGAGCCCCCGGTGGACCATCTGGTTGAGCACGTGGGGGTCAACTTCCACATTCCCAGGGAGCCCCCGGTGGACCATCTGGTTGAGCATGTGGGGGTCAACTTCCACATTCCCAGGGAGGCCCCGGTGGACCATCTGGTTGAGCACGTGGGGGTCAACTTCCACATTCCCAGGGAGCCCCCGG

ref/ref0/0

homozygous reference0% alternative allele

???

20% alternative allele

alt/alt1/1

homozygous alternative100% alternative allele

ref/altheterozygous

50% alternative allele

Variant calling: Variants in the sequencing data

ref/alt0/1

heterozygous

- samtools + bcftools (Sanger Institute, UK, and Broad Institute, US)

- Genome Analysis Tool Kit (GATK) (Broad Institute, US)

- VarScan (Washington University)

- Platypus (Welcome Trust Center, UK)

- freebayes (Boston College, US)

Keep in mind that different software use differentalgorithms and thresholds and results may vary A LOT.

Bioinformatic tools for single nucleotide variant calling

CHR POS REF ALT GT1 148588972 G C 0/11 154284894 A G 0/11 203923829 A G 0/11 243329075 T C 0/12 102968362 T C 0/12 122096456 G A 0/12 242612151 C T 1/13 56591283 TAAGCAGGGG TAAGCAGGGGGAAGCAGGGG 0/14 146297387 CAAAAAAAA CAAAAAAAAAA 0/16 116263181 T C 1/18 96070181 T C 1/19 129831659 T A 1/19 131454120 C T 1/110 29834095 G A 1/111 18159254 A G 1/111 35274829 A G 0/112 21628320 C T 0/115 42619508 C T 1/115 75336729 A G 0/116 84035844 T C 0/1

raw vcf file (“all variants”)

mostly experiment-independenttechnical and quality filtering (well-covered positions with confident alternative allele)

Single sample variant results

CHR chromosome

POS position on the chromosome

REF sequence in the reference genome

ALT alternative sequence detected in the sampleGT genotype in the (diploid) sample

filtered vcf file (“good quality variants”)

2. Identify somatic mutations in cancer (2-sample comparisons)

Somatic variants

Compare two samples of the same individual (e.g. tumor-normal)

CHR chromosome

POS position on the chromosome

REF sequence in the reference genome

ALT alternative sequence detected in the sampleGT genotype in the (diploid) sample, per sample

vcf file (“good quality variants - all genotypes”)

AC allele count, number of (ref, alt) bases, per sample

CHR POS REF ALT GT_normal AC_normal GT_tumor AC_tumor1 109856843 G T 1/1 5,48 0/1 13,341 1421916 T C 0/0 37,1 0/1 44,71 144951959 T C 0/1 65,17 0/0 70,111 145016035 A C 0/1 92,28 0/0 144,231 145112485 C T 0/1 361,91 0/0 457,982 91886070 A G 0/1 9,2 0/0 48,63 111685397 A G 0/1 7,6 0/0 12,25 116052289 G A 0/0 11,0 0/1 14,36 159029254 G C 0/1 14,7 0/0 21,16 160132580 T C 0/0 40,3 0/1 60,146 33022307 T C 0/0 14,1 0/1 8,47 102016578 T A 0/1 46,7 0/0 73,18 12425753 C T 0/0 11,1 0/1 8,311 1016704 C A 0/0 28,2 0/1 34,812 11506582 C T 0/0 26,2 0/1 28,617 16722266 C T 0/0 12,0 0/1 13,419 36642977 C T 0/1 9,2 0/0 34,019 55209249 A G 0/0 30,2 0/1 24,520 29628070 T C 0/0 13,0 0/1 22,421 10794052 G A 0/0 14,0 0/1 8,4

Compare two samples of the same individual (e.g. tumor-normal)

CHR POS REF ALT GT_normal AC_normal GT_tumor AC_tumor1 1421916 T C 0/0 37,1 0/1 44,71 179528803 A C 0/0 53,0 0/1 53,161 59096853 C T 0/0 19,1 0/1 21,72 132236963 C T 0/0 11,0 0/1 21,52 166756497 T A 0/0 20,1 0/1 23,83 53910122 G A 0/0 28,0 0/1 29,77 151962062 A G 0/0 12,0 0/1 19,59 136083801 A G 0/0 10,0 0/1 16,511 89407177 C T 0/0 15,0 0/1 31,612 129298780 G A 0/0 19,1 0/1 30,615 20454042 A T 0/0 22,1 0/1 24,515 23113683 T C 0/0 12,0 0/1 4,617 15468718 G A 0/0 11,0 0/1 16,517 15468728 T C 0/0 11,0 0/1 19,517 36365191 A T 0/0 11,0 0/1 18,417 66195635 T G 0/0 12,0 0/1 12,519 43783125 C A 0/0 10,0 0/1 24,519 43783146 A T 0/0 13,0 0/1 29,820 29628070 T C 0/0 13,0 0/1 22,421 11181025 G A 0/0 39,0 0/1 36,10

vcf file (“good quality variants - all genotypes”)

Definition of “somatic variant”, use sample purity informationSelect variants with genotype 0/0 in the normal and 0/1 in the tumor sampleAdditionally, select alternative allele frequency thresholds for normal and tumor sample using AC

filtered vcf file (“somatic variants”)

3. Identify variants for Mendelian disorders (trio analysis)

InheritanceDe novo variants

Compare three samples of a pedigree

Compare three samples of a pedigree

CHR chromosome

POS position on the chromosome

REF sequence in the reference genome

ALT alternative sequence detected in the sampleGT genotype in the (diploid) sample, per sample

vcf file (“good quality variants - all genotypes”)

CHR POS REF ALT GT_daughter GT_father GT_mother1 158165661 T C 1/1 1/1 1/11 16388565 C G 1/1 1/1 1/13 69112092 C T 0/1 0/1 0/14 247414 C A 0/0 0/0 0/06 30782303 C T 0/0 0/0 0/06 31239827 C T 0/1 0/1 0/17 23810849 T C 0/0 0/1 0/17 99235522 C T 0/1 0/1 0/18 58152832 G A 0/0 0/1 0/19 19573306 G A 0/0 0/0 0/09 27217822 G A 1/1 1/1 1/19 38604980 G A 0/0 0/0 0/012 124799387 G A 0/1 0/1 0/113 50500409 C T 1/1 1/1 1/114 74349313 C T 1/1 0/1 0/115 86076749 T G 0/0 0/1 0/116 46385880 C T 0/1 0/0 0/017 78092063 G A 0/1 0/1 0/119 8370120 T C 0/1 1/1 1/120 5905550 G A 0/0 0/1 0/1

Compare three samples of a pedigree

vcf file (“good quality variants - all genotypes”)

Apply model of inheritance: e.g. autosomal recessiveSelect variants with genotype 0/1 in the parents and 1/1 in the daughter

filtered vcf file (“recessively inherited variants”)

CHR POS REF ALT GT_daughter GT_father GT_mother1 200827638 A G 1/1 0/1 0/11 22158157 A G 1/1 0/1 0/12 171256597 A C 1/1 0/1 0/12 208976955 A C 1/1 0/1 0/14 48496368 A G 1/1 0/1 0/17 14017007 C T 1/1 0/1 0/18 143310815 G A 1/1 0/1 0/18 41517860 G A 1/1 0/1 0/111 47437403 C T 1/1 0/1 0/111 59837097 C T 1/1 0/1 0/112 9833628 C T 1/1 0/1 0/113 36699762 G A 1/1 0/1 0/113 52523808 C T 1/1 0/1 0/114 38256944 T C 1/1 0/1 0/115 79026001 C A 1/1 0/1 0/116 10788129 G T 1/1 0/1 0/116 1498197 A G 1/1 0/1 0/119 49640002 G T 1/1 0/1 0/120 10026357 T C 1/1 0/1 0/122 23657980 G A 1/1 0/1 0/1

Compare three samples of a pedigree

vcf file (“good quality variants - all genotypes”)

Apply model of inheritance: e.g. de-novoSelect variants with genotype 0/0 in the parents and 0/1 in the daughter

filtered vcf file (“de-novo variants”)

CHR POS REF ALT GT_daughter GT_father GT_mother1 13365778 A G 0/1 0/0 0/01 144676632 T C 0/1 0/0 0/01 144853029 C G 0/1 0/0 0/01 16891333 C T 0/1 0/0 0/03 143697451 T G 0/1 0/0 0/03 72311749 G T 0/1 0/0 0/03 9057481 C A 0/1 0/0 0/05 39002519 C T 0/1 0/0 0/06 33060143 A G 0/1 0/0 0/06 37845185 C A 0/1 0/0 0/07 1586741 G T 0/1 0/0 0/08 49987965 A C 0/1 0/0 0/09 39888209 C A 0/1 0/0 0/012 92562268 G T 0/1 0/0 0/012 94034134 G T 0/1 0/0 0/013 19042019 G T 0/1 0/0 0/015 84855648 C A 0/1 0/0 0/016 46427389 T C 0/1 0/0 0/017 10550780 A G 0/1 0/0 0/022 20643742 C A 0/1 0/0 0/0

4. Prioritize candidates for follow-up from previously selected variants

ReadPosRankSum = 1.635ReadPosRankSum = - 0.434

Prioritize variants: advanced technical filtering

SNVs and indels and tail distance bias/read position bias:

ReadPosRankSum = - 9.805

No reads spanning this region

samtools: VDB field (proportion), PV4 field (p-value)

GATK: ReadPosRankSum field (Z-score)

Strand bias, base quality bias and single nucleotide variants (in C-rich regions)

Prioritize variants: advanced technical filtering

Reads

Reads

samtools: PV4 field (p-value)

GATK: FS field (Phred-scaled p-value)

Prioritize variants: advanced technical filtering

More filters:

- per-sample depth

- per-sample genotype quality

Phred quality scores

Phred quality scores are logarithmically linked to error probabilities:

Fields in variant results using Phred scaled values, e.g.:

Overall position confidence (QUAL)Consensus quality (FQ)Per-sample genotype quality (GQ)Per-sample strand bias (SP)

(in fastq and bam files:)Base quality (original application)Mapping quality

http://www.ncbi.nlm.nih.gov/pubmed/9521921

Prioritize variants: snpEff functional annotations

CHR POS REF ALT Effect Effect_Impact5 138266546 G A UPSTREAM MODIFIER10 112361870 A G SYNONYMOUS_CODING LOW5 131819798 T C DOWNSTREAM MODIFIER7 148504854 AGACTT AGACTTGACTT DOWNSTREAM MODIFIER16 83658649 C G INTRON MODIFIER3 105422844 C T UPSTREAM MODIFIER12 11904983 A G INTRON MODIFIER19 33792631 C A DOWNSTREAM MODIFIER9 5081780 G A DOWNSTREAM MODIFIER17 15468718 G A DOWNSTREAM MODIFIER5 151054227 T C EXON MODIFIERX 123308030 A G EXON MODIFIER21 43531632 T C NON_SYNONYMOUS_CODING MODERATE7 139026152 T G INTRON MODIFIER5 1294086 C T SYNONYMOUS_CODING LOW10 112349204 A G INTRON MODIFIER10 89720907 T G INTRON MODIFIER6 15497422 A G INTRON MODIFIER11 118360787 TAAGTAAAG TAAG INTRON MODIFIER6 15500999 G A UPSTREAM MODIFIER

vcf file (“good quality variants – selected genotypes”)

CHR chromosome

POS position on the chromosome

REF sequence in the reference genome

ALT alternative sequence detected in the sampleEffect genomic annotation of the

variant positionEffect_Impact: 4 classes:

HIGHMODERATELOWMODIFIER

snpEff tool and documentation are available at: http://snpeff.sourceforge.net/SnpEff_manual.html

Prioritize variants: snpEff functional annotations

CHR POS REF ALT Effect Effect_Impact6 80196848 C T NON_SYNONYMOUS_CODING MODERATE3 138289221 C T NON_SYNONYMOUS_CODING MODERATE3 137743508 A T NON_SYNONYMOUS_CODING MODERATE7 143826697 C G NON_SYNONYMOUS_CODING MODERATE22 45821909 C T NON_SYNONYMOUS_CODING MODERATE9 104335682 C T NON_SYNONYMOUS_CODING MODERATE22 22869123 T C NON_SYNONYMOUS_CODING MODERATE1 206516261 C T NON_SYNONYMOUS_CODING MODERATE5 37688010 A G SPLICE_SITE_ACCEPTOR HIGH12 6560473 A G NON_SYNONYMOUS_CODING MODERATE20 29632714 C A NON_SYNONYMOUS_CODING MODERATE21 43896143 C T NON_SYNONYMOUS_CODING MODERATE2 207656411 A G NON_SYNONYMOUS_CODING MODERATE9 116136198 C T NON_SYNONYMOUS_CODING MODERATE3 133666209 C T NON_SYNONYMOUS_CODING MODERATE3 73111809 A G NON_SYNONYMOUS_CODING MODERATE12 10999708 T C NON_SYNONYMOUS_CODING MODERATE18 30804756 C T NON_SYNONYMOUS_CODING MODERATE19 43382368 T G NON_SYNONYMOUS_CODING MODERATE

vcf file (“good quality variants – selected genotypes”)

Select variants with protein-coding effectSelect Effect_Impact HIGH and MODERATE

filtered vcf file (“coding variants”)

Prioritize variants: snpEff functional annotations

CHR POS REF ALT Effect Effect_Impact Amino_Acid_Change19 58491627 T C NON_SYNONYMOUS_CODING MODERATE S141G;S141G;S51G11 5701281 G A NON_SYNONYMOUS_CODING MODERATE H43Y;H43Y;H43Y;H43Y;H43Y;H43Y;H43Y;H43Y19 282753 G A NON_SYNONYMOUS_CODING MODERATE A124V;A180V;A201V;A73V11 6913244 A G NON_SYNONYMOUS_CODING MODERATE I163T20 377204 G A NON_SYNONYMOUS_CODING MODERATE R316Q;R343Q1 148343758 C T NON_SYNONYMOUS_CODING MODERATE R110K;R110K;R110K;R35K14 22447251 G A NON_SYNONYMOUS_CODING MODERATE G78S12 50746348 A G NON_SYNONYMOUS_CODING MODERATE F1423L;F1423L16 19548116 G A NON_SYNONYMOUS_CODING MODERATE M375I;M375I;M375I5 179201847 G A NON_SYNONYMOUS_CODING MODERATE S1007N12 56030938 C T NON_SYNONYMOUS_CODING MODERATE P88L6 116325142 C T NON_SYNONYMOUS_CODING MODERATE G122R4 113352397 G A NON_SYNONYMOUS_CODING MODERATE G487D;G565D;G565D10 99337572 G A NON_SYNONYMOUS_CODING MODERATE A35T;A35T;A62T;A62T

vcf file (“good quality variants – selected genotypes”)

Select variants with protein-coding effectSelect Effect NON_SYNONYMOUS_CODING

filtered vcf file (“non_syn_coding variants”)

Check deleteriousness predictions...

Prioritize variants: population frequencies

vcf file (“good quality variants – selected genotypes”)

CHR POS REF ALT esp5400_ea5 132039132 C A 0.7403027 21939089 C A 0.3469961 5926507 T C 0.4236557 21789442 T C 0.6685685 13701525 T C 0.5410268 17796383 C T 0.8023268 17804665 G A 0.62595517 79173184 T C 0.973023X 38158290 C T 0.1509477 21893993 G T 0.6833947 21939089 C A 0.3469967 21924014 A G 0.8521761 5935162 A T 0.8176968 17797409 A G 0.7730734 5750084 A T 0.4363257 37889786 A G 0.4560928 17868245 C A 0.8088477 21893993 G T 0.6833947 21939032 C A 0.32563316 4386814 T C 0.965792

CHR chromosome

POS position on the chromosome

REF sequence in the reference genome

ALT alternative sequence detected in the sample

esp5400_ea frequency in a European American population (Exome Variant Server)

Annotations downloaded from: http://evs.gs.washington.edu/EVS/

Prioritize variants: population frequencies

vcf file (“good quality variants – selected genotypes”)

CHR POS REF ALT esp5400_ea11 61166306 G A [NA]16 53634295 C T [NA]7 37934147 A T 0.0069827 37934146 T C 0.00926215 73029774 T C 0.121899X 38158290 C T 0.15094716 53672355 C T 0.3003994 5747072 C T 0.3101147 21939032 C A 0.3256337 21939089 C A 0.3469966 51613177 C T 0.36054117 79170576 C T 0.3927866 51640751 C G 0.3948341 5925371 G A 0.4121171 5926507 T C 0.4236554 5750084 A T 0.4363257 37889786 A G 0.4560927 21695416 A G 0.4863646 51935747 C T 0.4944666 51913240 G A 0.510541

Use prior knowledge of the frequency of the condition to be studiedSort population frequency in ascending order or select a threshold, including positions without frequency

filtered vcf file (“low frequency variants”)

View .bam files in the Integrative Genomics Viewer

This open source tool is available at: https://www.broadinstitute.org/igv/

Use a combination of advanced technical filtering, functional annotations,population frequency and visual inspection of variant positions.

Additional information may be taken from: - clinical information (e.g. aberrant karyotype)- previous genetic studies (linkage regions)- the literature (candidate genes)- dbSNP (especially for markers outside coding regions)- intersection or union of many samples for the same condition

- conservation and deleteriousness predictors... coming up now!

Summary and recommendations