ngs, cancer and bioinformatics - université...

71
NGS, Cancer and Bioinformatics Exome-sequencing followed by Variant Calling. 29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Upload: others

Post on 06-May-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

NGS, Cancer and BioinformaticsExome-sequencing followed by Variant Calling.

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Page 2: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Overview of exome analysis

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Galaxy Workflow

10

Reads

(Fastq)

Reference Genome

(Fasta)

Conversion

to Galaxy

Format

---------

Groomer

Quality

Control

---------

FastQC

Mapping

---------

Bowtie2

PCR duplicates

Marking

---------

MarkDup

Preprocess GATK

part 1

---------

Local realignment

around indels

Preprocess GATK

part 2

---------

Base Quality Score

Recalibration

Target

Intersection

---------

Intersect Bam

Target

regions

(bed)

Aligned and preprocessed

reads (BAM)

---------- Marked PCR duplicates

- Intersected on target regions

- Realigned around indels

- Recalibrated

7 - 9 AVRIL 2014FORMATION “NGS &CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”

Page 3: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Public dataset

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

• Accessible online on SRA (Sequence Read Archive): ERA148528

Exome sequencing of 2 samples: tumor (lung cancer) and blood(normal sample)

Publication : Ys et al., Genome Res. 2012 Mar;22(3):436-45

• 100bp paired-end reads, Illumina HiSeq 2000• Mean depth higher for the tumor sample (~100X) than for the normal sample (~30X) to detect somatic variant with a low allelicfrequency• Aligned Exome size: ~15 Go tumor; ~7 Go blood• Complete analysis processing time: ~20h

Need to restrict the analysis to a few regions in order to limit the processing time (~112kb)

Page 4: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Select libraries on Galaxy

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

1. Open your web browser and go to ”https://galaxy.gustaveroussy.fr/galaxyprod”

2. In the top menu, click on « Shared Data » then « Data librairies »

3. Click on [FORMATION] Input Data then « EXOME »4. Select « tumor_R1.fastq » ; « tumor_R2.fastq » ;

« exome_regions.bed » ; « known_sites_regions.vcf » then click on « Go »

Select Librairies on Galaxy

12

1. Open your web browser and go to « http://galaxy.sb-roscoff.fr »

2. In the top menu, click on « Shared Data » then « Data librairies »

3. Click on «canceropole-tp-input »

4. Select « tumor_R1.fastq » ; « tumor_R2.fastq » ; « exome_regions.bed » ;

« known_sites_regions.vcf » then click on « Go ».

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

Page 5: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

FASTQ format conversion

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

1. Rename your history to « Tumor » by clicking on « Unnamedhistory ».2. In the left panel, click on the « search tools » textbox and enter « FASTQ Groomer » and then click on it to convert both your FASTQ into FASTQ Sanger Format.3. Click on « Execute » to launch the conversion.

FASTQ format conversion

13

1. Rename your history to « Tumor » by clicking on « Unnamed history »

2. In the left panel, click on « FASTQ Groomer » under the NGS: QC and

manipulation section to convert both your FASTQ into FASTQ Sanger Format

3. Click on « Execute » to launch the conversion

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

Page 6: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

FASTQC : FASTQ Quality Control

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

1. In the left panel, click on the « search tools » textbox and enter « FASTQC: Read QC »2. Select the FASTQ Groomer dataset and click on « Execute »; repeatfor both reads

The result of FASTQC is an html page that you can view by clicking on the eye

FASTQC : FASTQ Quality Control

14

1. In the left panel, click on « FASTQC: Read QC » under the NGS: QC and

manipulation section

2. Select the FASTQ Groomer dataset and click on « Execute »; repeat for both reads

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

The result of FASTQC is an html page that you can view by clicking on the eye

GENERAL TIP : RENAME YOUR HISTORY ITEMS FREQUENTLY TO BE MORE EXPLICIT THAN « on data xxx » !

Page 7: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

FastQC Metrics

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

• Look at the different metrics for both reads• Problem: the per base sequence quality of the Read2 are quitelow towards the end

Solution:Trim the 25bp from the 3’ end of the reads Higher confidence in the

sequenced information

Page 8: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

FASTQ Trimmer

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

1. Use « FASTQ Trimmer » to cut off 25bp from 3’ end of the Groomed Reads (use the « search tools » object to find the tool)2. Run « FASTQC » on the trimmed reads

1. Use « FASTQ Trimmer » to cut off 25bp from 3’ end of t he Groomed Read2 (use

the « search tools » object to find the tool)

2. Run « FASTQC» on the trimmed reads

FASTQ Trimmer

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

16

1. Use « FASTQ Trimmer » to cut off 25bp from 3’ end of t he Groomed Read2 (use

the « search tools » object to find the tool)

2. Run « FASTQC» on the trimmed reads

FASTQ Trimmer

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

16

Page 9: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

If bad-quality sequences/bases distribution more complex

Use a more « elaborated » trimming step

12 - 14 novembre 2014 Formation NGS & Cancer - Analyses RNA-Seq

Page 10: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Overview of exome analysis

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Galaxy Workflow

10

Reads

(Fastq)

Reference Genome

(Fasta)

Conversion

to Galaxy

Format

---------

Groomer

Quality

Control

---------

FastQC

Mapping

---------

Bowtie2

PCR duplicates

Marking

---------

MarkDup

Preprocess GATK

part 1

---------

Local realignment

around indels

Preprocess GATK

part 2

---------

Base Quality Score

Recalibration

Target

Intersection

---------

Intersect Bam

Target

regions

(bed)

Aligned and preprocessed

reads (BAM)

---------- Marked PCR duplicates

- Intersected on target regions

- Realigned around indels

- Recalibrated

7 - 9 AVRIL 2014FORMATION “NGS &CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”

Page 11: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Mapping with Bowtie2

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

1. Use « Bowtie2 » from the « Mapping » section to align reads on the hg19 genome1. Use « Bowtie2 » from the « Mapping » section to align reads on the hg19 genome

Mapping with Bowtie2

24

Preset option:

combination of parameters

designed to have a good tradeoff

between speed, sensitivity,

accuracy

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

24

1. Use « Bowtie2 » from the « Mapping » section to align reads on the hg19 genome

Mapping with Bowtie2

24

Preset option:

combination of parameters

designed to have a good tradeoff

between speed, sensitivity,

accuracy

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

24

1. Use « Bowtie2 » from the « Mapping » section to align reads on the hg19 genome

Mapping with Bowtie2

24

Preset option:

combination of parameters

designed to have a good tradeoff

between speed, sensitivity,

accuracy

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

24

Page 12: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

SAM/BAM aligned format

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

• SAM Format: aligned format, human readable

SAM/BAM aligned format

@SQ SN:chr12 LN:133851895

@RG ID:Sample_ID LB:Sample_Library PL:ILLUMINA SM:Sample_Name PU:Platform_Unit

ERR166338.1 99 chr12 82670685 23 101M = 82670850 266

GCCCCTGGGGATGTTTTGCACCAAGCCACTGTCTCCAGCTGG

BBC@GIIHGCFCIEHEAIEIFFGEONDNJFINIONHNGJNNNNKNJN

RG:Z:Sample_ID XT:A:U NM:i:0 X0:i:1 X1:i:1 XM:i:0 XO:i:0 XG:i:0 MD:Z:100 XA:Z

• BAM Format: Binary SAM Format (not human readable but compressed = smaller)

• SAM Format: aligned format, human readable

Read name Flag Chr 5’ pos MAPQ Cigar paired 5’ pos of

the mate Insert size

sequence

Base quality

tags

Group affiliation

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

25

• BAM Format: Binary SAM Format (not human readable but compressed = smaller)

Page 13: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Mapping statistics (directly from the mapper)

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Page 14: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Mapping statistics

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

• Use « Flagstat » from « Samtools » to see some mapping statistics

Mapping Statistics

• Use « Flagstat » from « Samtools » to see some mapping statistics

% of mapped reads

Properly paired reads:

- 0<= Insert size <= Max size

- Reads on same chromosome

- Reads facing each other

- Both reads are mapped

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

26

Page 15: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Removing duplicates (not for targetedsequencing)

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

• Duplicates reads: different reads having the same sequence caused by PCR amplication during sequencing library preparation• The removal of the duplicates depends on the application (not suitable for sequencing on small target)

Removing Duplicates

• Duplicates reads: different reads having the same sequence caused by PCR

amplication during sequencing library preparation

• The removal of the duplicates depends on the application (not suitable for sequencing

on small target)

PCRdup

removal

• Galaxy: Use “Mark D

u

plicates reads” from “NGS:Picard” to mark duplicates

• Galaxy: Run “Flagstat” on the output BAM to se e th e nu mber of PC R du plicates

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

27

• Galaxy: Use “Mark Duplicates reads” from “NGS:Picard” to mark duplicates

• Galaxy: Run “Flagstat” on the output BAM to see the number of PCR duplicates

Page 16: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Target intersection

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

• Use « Intersect BAM alignments with intervals » from« NGS:Bedtools » to keep only the reads mapped on the targetedregions Smaller BAM size The targeted regions must be in BED format (4 columns : chr ;

start ; end ; name)

Target intersection

• Use « Intersect BAM alignments with intervals » from « NGS:Bedtools » to keep only

the reads mapped on the targeted regions

Smaller BAM size

The targeted regions must be in BED format (4 columns : chr ; start ; end ; name)

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

28

Page 17: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Group association

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

• Use « Add or Replace Groups » from « NGS:Picard » to associate a sample ID and a sequencing technology to the reads

Mandatory for some tools (GATK) or in multi-sample analysis

Group Association

• Use « Add or Replace Groups » from « NGS:Picard » to associate a sample ID

and a sequencing technology to the reads

Mandatory for some tools (GATK) or in multi-sample analysis

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

29

Page 18: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Overview of exome analysis

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Next session

30

Reads

(Fastq)

Reference Genome

(Fasta)

Conversion

to Galaxy

Format

---------

Groomer

Quality

Control

---------

FastQC

Mapping

---------

Bowtie2

PCR duplicates

Marking

---------

MarkDup

Preprocess GATK

part 1

---------

Local realignment

around indels

Preprocess GATK

part 2

---------

Base Quality Score

Recalibration

Target

Intersection

---------

Intersect Bam

Target

regions

(bed)

Aligned and preprocessed

reads (BAM)

---------- Marked PCR duplicates

- Intersected on target regions

- Realigned around indels

- Recalibrated

7 - 9 AVRIL 2014FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”

Page 19: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Variant calling pre-processing : GATK part 1

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Preprocess GATK: part 1

4

Reads

(Fastq)

Reference Genome

(Fasta)

Conversion

to Galaxy

Format

---------

Groomer

Quality

Control

---------

FastQC

Mapping

---------

Bowtie2

PCR duplicates

Marking

---------

MarkDup

Preprocess GATK

part 1

---------

Local realignment

around indels

Preprocess GATK

part 2

---------

Base Quality Score

Recalibration

Target

Intersection

---------

Intersect Bam

Target

regions

(bed)

Aligned and preprocessed

reads (BAM)

---------- Marked PCR duplicates

- Intersected on target regions

- Realigned around indels

- Recalibrated

7 - 9 AVRIL 2014FORMATION “NGS &CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”

Page 20: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Why realign around indels ?

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

• Small Insertion/deletion (Indels) in reads (especially near the ends) can trick the mappers into mis-aligning with mismatches Alignment scoring – cheaper to introduce multiple Single

Nucleotide Variants (SNVs) than an indel: induce a lot of false positive SNVs

• These artifactual mismatches can harm base quality recalibrationand variant detection

• Realignment around indels helps improve the accuracy of severalof the downstream processing steps

Page 21: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Local realignment identifies most parsimoniousalignment along all reads at a problematic locus

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

1. Find the best alternate consensus sequence that, together withthe reference, best fits the reads in a pile

6 Local realignment identifies most parsimonious

alignment along all reads at a problematic locus

1. Find the best alternate consensus sequence that, together with the reference,

best fits the reads in a pile

2. The score for an alternate consensus is the total sum of the quality scores of

mismatching bases

3. If the score of the best alternate consensus is sufficiently better than the original

alignments, then we accept the proposed realignment of the reads

Realigning

determines

which is better

consistent with the reference

consistent with a 3bp insertion

3 adjacent

SNPs

Local realignment identifies most parsimonious alignment along all reads at a problematic locus

FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES” 7 - 9 AVRIL 2014

2. The score for an alternate consensus is the total sum of the qualityscores of mismatching bases3. If the score of the best alternate consensus is sufficiently betterthan the original alignments, then we accept the proposedrealignment of the reads

Page 22: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Three types of realignment targets

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

• Known sites: Common polymorphisms: dbSNP, 1000Genomes

• Indels seen in original alignments (in CIGAR, indicated by I for Insertion or D for Deletion)

• Sites where evidences suggest a hidden indel (SNVsabundance)

Page 23: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Known siteshttps://www.broadinstitute.org/gatk/guide/tagged?tag=knownsites

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Why are they important?

Each tool uses known sites differently, but what is common to all is that they use them to help distinguish true variants from false positives, which is very important to how these tools work. If you don't provide known sites, the statistical analysis of the data will be skewed, which candramatically affect the sensitivity and reliability of the results.

In the variant calling pipeline, the only tools that do not strictly require known sites are UnifiedGenotyper and HaplotypeCaller.

2. Recommended sets of known sites per tool

Page 24: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Local realignment around indels

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

After Before SNVs

Deletion

SNVs

Deletion

FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES” 7 - 9 AVRIL 2014

Local realignment around indels

8

Page 25: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Local realignment around indels

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

9

FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES” 7 - 9 AVRIL 2014

Local realignment around indels

Page 26: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Local realignment around indels

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

10

FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES” 7 - 9 AVRIL 2014

Local realignment around indels

Page 27: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Indel realignment steps/tools

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

1. Identify what regionsneed to be realigned RealignerTargetCreator

+ known sites

2. Perform the actualrealignment (BAM output) IndelRealigner

1. Identify what regions need

to be realigned

RealignerTargetCreator

2. Perform the actual

realignment (BAM output)

IndelRealigner

11

Indel realignment steps/tools

+ known sites

Intervals

FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES” 7 - 9 AVRIL 2014

1. Identify what regions need

to be realigned

RealignerTargetCreator

2. Perform the actual

realignment (BAM output)

IndelRealigner

11

Indel realignment steps/tools

+ known sites

Intervals

FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES” 7 - 9 AVRIL 2014

Intervals

Page 28: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Galaxy: Realigner Target Creator

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

• Use « Realigner Target Creator » from « GATK Tools » to detect intervals in need of local realignment

12

FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES” 7 - 9 AVRIL 2014

Galaxy: Realigner Target Creator • Use « Realigner Target Creator » from « GATK Tools » to detect intervals in need of

local realignment

Add new binding

for reference-

ordered datas

Choose

advanced

GATK options

Add new operate

on Genomic

Intervals

Page 29: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Galaxy: Indel Realigner

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

• Use « Indel Realigner » from « GATK Tools » to apply local realignment

13

7 - 9 AVRIL 2014

Galaxy: Indel Realigner

• Use « Indel Realigner » from « GATK Tools » to apply local realignment

Add new binding

for reference-

ordered datas

Choose

advanced

GATK options

Add new operate

on Genomic

Intervals

Page 30: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Preprocess GATK: part 2

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Preprocess GATK: part 2

14

Reads

(Fastq)

Reference Genome

(Fasta)

Conversion

to Galaxy

Format

---------

Groomer

Quality

Control

---------

FastQC

Mapping

---------

Bowtie2

PCR duplicates

Marking

---------

MarkDup

Preprocess GATK

part 1

---------

Local realignment

around indels

Preprocess GATK

part 2

---------

Base Quality Score

Recalibration

Target

Intersection

---------

Intersect Bam

Target

regions

(bed)

Aligned and preprocessed

reads (BAM)

---------- Marked PCR duplicates

- Intersected on target regions

- Realigned around indels

- Recalibrated

7 - 9 AVRIL 2014FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”

Page 31: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Why recalibrate base qualities ?

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

15

Real data is messy so properly estimating the evidence is critical

Why recalibrate base qualities?

FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES” 7 - 9 AVRIL 2014

Real data is messy so properly estimating the evidence is critical

Page 32: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

The quality scores issued by sequencers are inaccurate and biased

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

• Quality scores are critical for all downstream analysis• Systematic biases are a major contributor to bad calls• Example of sequence context bias in the reported qualities:

16

The quality scores issued by sequencers are inaccurate and biased

• Quality scores are critical for all downstream analysis

• Systematic biases are a major contributor to bad calls

• Example of sequence context bias in the reported qualities:

before after

FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES” 7 - 9 AVRIL 2014

The quality scores issued by sequencers are inaccurate and biased

Page 33: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Evidences of error covariates

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

• Analyze covariation among several features of a base, e.g.:

• Reported quality score

• Position within the read (machine cycle)

• Preceding and current nucleotides (chemistry effect)

• Sequencing technology...

• Adjust the quality score associated to each sequenced base to bemore accurate Remove systematic biases

Page 34: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

How the covariates are analyzed ?

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

• Keep track of the number of observations and the number of times it was an error as a function of various covariates:

• Typically stratify the data by lane, by original quality score, by machine cycle and sequencing context• Databases of known variants are used to discount most of the real genetic variation present in the sample• All other differences are assumed to be errors• Having done indel realignment first reduces noise

18

How the covariates are analyzed?

• Keep track of the number of observations and the number of times it was

an error as a function of various covariates:

• Typically stratify the data by lane, by original quality score, by

machine cycle and sequencing context

• Databases of known variants are used to discount most of the real

genetic variation present in the sample

• All other differences are assumed to be errors

• Having done indel realignment first reduces noise

Phred-scaled

Quality score

#mismatches+1

#bases+2

FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES” 7 - 9 AVRIL 2014

How the covariates are analyzed?

https://www.broadinstitute.org/gatk/guide/article?id=44

Page 35: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

BQSR steps/tools

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

1. Model the error modes and recalibrate qualities Base Recalibrator

19

1. Model the error modes and

recalibrate qualities

Count Covariates

2. Perform the actual

recalibration (BAM output)

Table Recalibration

FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES” 7 - 9 AVRIL 2014

BQSR steps/tools

Covariates 2. Perform the actualrecalibration (BAM output) Print Reads

19

1. Model the error modes and

recalibrate qualities

Count Covariates

2. Perform the actual

recalibration (BAM output)

Table Recalibration

FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES” 7 - 9 AVRIL 2014

BQSR steps/tools

Covariates

Co

variates

Page 36: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Galaxy: Base Recalibrator

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

• Use « Base Recalibrator » from « GATK Tools » to recalibrate base quality scores

Add new binding for reference-

ordered datas

Chooseadvanced

GATK optionsAdd new

operate on Genomicintervals

Page 37: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Galaxy: Print Reads

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Page 38: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Final results

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Final Results

22

Reads

(Fastq)

Reference Genome

(Fasta)

Conversion

to Galaxy

Format

---------

Groomer

Quality

Control

---------

FastQC

Mapping

---------

Bowtie2

PCR duplicates

Marking

---------

MarkDup

Preprocess GATK

part 1

---------

Local realignment

around indels

Preprocess GATK

part 2

---------

Base Quality Score

Recalibration

Target

Intersection

---------

Intersect Bam

Target

regions

(bed)

Aligned and preprocessed

reads (BAM)

---------- Marked PCR duplicates

- Intersected on target regions

- Realigned around indels

- Recalibrated

7 - 9 AVRIL 2014FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”

Page 39: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Next step: somatic variant calling

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Next Step: somatic variant calling

Variant Calling

---------

VarScan

Somatic

Variant

Annotation

-----

Annovar

Mpileup

Tumeur

---------

Samtools

Mpileup

Mpileup

Normal

---------

Samtools

Mpileup

Aligned and

preprocessed Reads

(BAM)

Aligned and

preprocessed Reads

(BAM)

Tumor

Normal

Variant

Selection

-----

Select

23

7 - 9 AVRIL 2014FORMATION “NGS &CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”

Page 40: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Using workflow in Galaxy: Repeat the same stepsfor the blood sample

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Repeat the same steps for the blood sample

Variant Calling

---------

VarScan

Somatic

Variant

Annotation

-----

Annovar

Mpileup

Tumeur

---------

Samtools

Mpileup

Mpileup

Normal

---------

Samtools

Mpileup

Aligned and

preprocessed Reads

(BAM)

Aligned and

preprocessed Reads

(BAM)

Tumor

Normal

Variant

Selection

-----

Select

4

7 - 9 AVRIL 2014FORMATION “NGS &CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”

Repeat process

Page 41: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Repeat all these steps in a few click…

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Repeat all these steps in  a

 few  click…

5

Reads

(Fastq)

Reference Genome

(Fasta)

Conversion

to Galaxy

Format

---------

Groomer

Quality

Control

---------

FastQC

Mapping

---------

Bowtie2

PCR duplicates

Marking

---------

MarkDup

Preprocess GATK

part 1

---------

Local realignment

around indels

Preprocess GATK

part 2

---------

Base Quality Score

Recalibration

Target

Intersection

---------

Intersect Bam

Target

regions

(bed)

Aligned and preprocessed

reads (BAM)

---------

- Marked PCR duplicates

- Intersected on target regions

- Realigned around indels

- Recalibrated

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

Page 42: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Select libraries on Galaxy

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

1. In the top menu, click on « Shared Data » then « Data librairies »2. Click on «[FORMATION] Input Data» then « EXOME »3. Select « normal_R1.fastq » ; « normal_R2.fastq » ; « exome_regions.bed » ; « known_sites_regions.vcf »4. Select « Import to Histories » then click on Go

Select Librairies on Galaxy

6

1. In the top menu, click on « Shared Data » then « Data librairies »

2. Click on «canceropole-tp-input »

3. Select « normal_R1.fastq » ; « normal_R2.fastq » ; « exome_regions.bed » ;

« known_sites_regions.vcf »

4. Select « Import to Histories » then click on Go

5. Write a new history name and click on

« import library datasets »

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

Select Librairies on Galaxy

6

1. In the top menu, click on « Shared Data » then « Data librairies »

2. Click on «canceropole-tp-input »

3. Select « normal_R1.fastq » ; « normal_R2.fastq » ; « exome_regions.bed » ;

« known_sites_regions.vcf »

4. Select « Import to Histories » then click on Go

5. Write a new history name and click on

« import library datasets »

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

5. Write a new history nameand click on « import librarydatasets »

Page 43: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Extract workflow from history

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

1. In the top menu, click on « AnalyzeData » to return to the main frame2. In the « history » panel, click on the topside wheel then on « ExtractWorkflow »3. Write a name for your workflow thenclick on « Create Workflow »

Extract workflow from history

7

1. In the top menu, click on « Analyze Data » to return to

the main frame

2. In the « history » panel, click on the topside wheel then

on « Extract Workflow »

3. Write a name for your workflow then click on « Create

Workflow »

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

Extract workflow from history

7

1. In the top menu, click on « Analyze Data » to return to

the main frame

2. In the « history » panel, click on the topside wheel then

on « Extract Workflow »

3. Write a name for your workflow then click on « Create

Workflow »

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

Page 44: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Edit your workflow

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

1. In the top menu, click on « Workflow » to return to the main frame2. Click on your new workflow then select «Edit»3. Identify and rename « Input dataset » boxes corresponding to R1/R2/BED/VCF4. Each box represent a tool set with parametersthat you can modify by clicking on it

Click on the « Add or Replace Groups » box and change the read group ID and name (you can also choose to set them atruntime)

Check the input of the « Intersect BAM» box: it should be the BAM output from « Mark Duplicates »

5. Save your edited worfklow by clicking on the wheel then « save »

Edit your worklow

8

1. In the top menu, click on « Workflow » to return

to the main frame

2. Click on your new workflow then select « Edit »

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

3. Each box represent a tool set with parameters that

you can modify by clicking on it

Click on the « Add or Replace Groups » box and

change the read group ID and name (you can also

choose to set them at runtime)

Check the input of the « Intersect BAM» box: it

should be the BAM output from « Mark Duplicates »

and from « flagstat » 4. Save your edited worfklow by clicking on the wheel

then « save »

Set at

runtime

Edit your worklow

8

1. In the top menu, click on « Workflow » to return

to the main frame

2. Click on your new workflow then select « Edit »

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

3. Each box represent a tool set with parameters that

you can modify by clicking on it

Click on the « Add or Replace Groups » box and

change the read group ID and name (you can also

choose to set them at runtime)

Check the input of the « Intersect BAM» box: it

should be the BAM output from « Mark Duplicates »

and from « flagstat » 4. Save your edited worfklow by clicking on the wheel

then « save »

Set at

runtime

Page 45: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Run your workflow on the blood sample

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

1. A workflow can only be runned on data presentin the current history Click on the wheel in the top of your history

panel and select « saved histories » Click on the « Normal » history and click on

« switch » Go back to the « Workflow » page (top

panel) then click on your workflow then on « Run »

2. Check that all your input files are correct (step1: bed, step2: vcf, step3: R1, step4: R2)3. Click on « Run workflow » at the bottom of the page

Run your workflow on the blood sample

9

1. A workflow can only be runned on data present in the current

history

Click on the wheel in the top of your history panel and

select « saved histories »

Click on the « Normal » history and click on « switch »

Go back to the « Workflow » page (top panel) then click

on your workflow then on « Run »

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

2. Check that all your input files are correct (step1: bed, step2: vcf, step3: R1, step4: R2)

3. Click on « Run workflow » at the bottom of the page

Run your workflow on the blood sample

9

1. A workflow can only be runned on data present in the current

history

Click on the wheel in the top of your history panel and

select « saved histories »

Click on the « Normal » history and click on « switch »

Go back to the « Workflow » page (top panel) then click

on your workflow then on « Run »

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

2. Check that all your input files are correct (step1: bed, step2: vcf, step3: R1, step4: R2)

3. Click on « Run workflow » at the bottom of the page

Page 46: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Let Galaxy work for you!

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Let Galaxy work for you!

10

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

Let Galaxy work for you!

10

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

Let Galaxy work for you!

10

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

Page 47: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Next Step

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Next Step

Variant Calling

---------

VarScan

Somatic

Variant

Annotation

-----

Annovar

Mpileup

Tumeur

---------

Samtools

Mpileup

Mpileup

Normal

---------

Samtools

Mpileup

Aligned and

preprocessed Reads

(BAM)

Aligned and

preprocessed Reads

(BAM)

Tumor

Normal

Variant

Selection

-----

Select

11

7 - 9 AVRIL 2014FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”

Page 48: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Somatic Variant Calling with VarScan2

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Somatic Variant Calling with Varscan

Variant Calling

---------

VarScan

Somatic

Variant

Annotation

-----

Annovar

Mpileup

Tumeur

---------

Samtools

Mpileup

Mpileup

Normal

---------

Samtools

Mpileup

Aligned and

preprocessed Reads

(BAM)

Aligned and

preprocessed Reads

(BAM)

Tumor

Normal

Variant

Selection

-----

Select

4

7 - 9 AVRIL 2014FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”

Page 49: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Variant Calling

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

• Factors to consider when calling a SNVs:– Base call qualities of each supporting base (base quality)– Proximity to small indels, or homopolymer run– Mapping qualities of the reads supporting the SNP– Sequencing depth: >=30x for constit ; >=100 for tumor– SNVs position within the reads: Higher error rate at the

reads ends– Look at strand bias (SNVs supported by only one strand are

more likely to be artifactual)– Allelic frequency: Tumor cellularity will reduce the % of an

heterozygous variant• Higher stringency when calling indels (and sanger validation oftenneeded)

Page 50: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Depth of Coverage

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Depth of Coverage = number of reads supporting one position ex: 1X, 5X, 100X... >1000X

NGS - Applications and Analysis

Depth of Coverage = number of reads supporting one positions

ex: 1X, 5X, 100X… >1000X

Reference Genome

7X 5X brin + 2X brin -

Calling Confidence

Reference Base

SNV Sequencing Error

Aligned Reads

2X 17X 9X brin + 8X brin -

100% = SNV

Homozygote

50% 50%

= SNV Heterozygote

--- +++

SNV and sequence context (errors)

7

Depth of Coverage

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

Page 51: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

VarScan2

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

• Mutation caller written in Java (no installation required) working withPileup files of Targeted, Exome, and Whole-Genome sequencing data (DNAseq or RNAseq)

• Multi-platforms: Illumina, SOLiD, Life/PGM, Roche/454

• Detection of different kinds of Germline SNVs/Indels (classical mode): Variants in individual samples Multi-sample variants shared or private in multi-sample datasets

• VarScan specificity is to be able to work with Tumor/Normal pairs (somatic mode):

Somatic and germline mutation, LOH events in tumor-normal pairs Somatic copy number alterations (CNAs) in tumor-normal exome

data

Page 52: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

VarScan2

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

• Most published variant callers use Bayesian statistics (a probabilistic framework) to detect variants and assess confidence in them (e.g.: GATK)

• VarScan uses a robust heuristic/statistic approach to call variantsthat meet desired thresholds for read depth, base quality, variant allele frequency, and statistical significance

• In Stead et al. (2013), they compared 3 different somatic callers : MuTect, Strelka, VarScan2 VarScan2 performed best overall with sequencing depths of 100x,

250x, 500x and 1000x required to accurately identify variantspresent at 10%, 5%, 2.5% and 1% respectively

Page 53: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Common history

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

1. In the wheel from the history panel, select « Copy datasets »2. Select the preprocessed BAM from Normal and Tumor histories and « exome_regions.bed »

Common history

1. In the wheel from the history panel, select « Copy datasets »

2. Select the preprocessed BAM from Normal and Tumor histories

and « exome_regions.bed »

10

Common history

1. In the wheel from the history panel, select « Copy datasets »

2. Select the preprocessed BAM from Normal and Tumor histories

and « exome_regions.bed »

10

Common history

1. In the wheel from the history panel, select « Copy datasets »

2. Select the preprocessed BAM from Normal and Tumor histories

and « exome_regions.bed »

10

Common history

1. In the wheel from the history panel, select « Copy datasets »

2. Select the preprocessed BAM from Normal and Tumor histories

and « exome_regions.bed »

10

Common history

1. In the wheel from the history panel, select « Copy datasets »

2. Select the preprocessed BAM from Normal and Tumor histories

and « exome_regions.bed »

10

Page 54: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Common history

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

1. Edit each BAM attributes by clicking on the little pen2. To add more clarity, rename your BAM in « Normal.bam » and « Tumor.bam » then « Save » the changes

Common history

1. Edit each BAM attributes by clicking on the little pen

2. To add more clarity, rename your BAM in « Normal.bam » and « Tumor.bam »

then « Save » the changes

11

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

Common history

1. Edit each BAM attributes by clicking on the little pen

2. To add more clarity, rename your BAM in « Normal.bam » and « Tumor.bam »

then « Save » the changes

11

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

Common history

1. Edit each BAM attributes by clicking on the little pen

2. To add more clarity, rename your BAM in « Normal.bam » and « Tumor.bam »

then « Save » the changes

11

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

Page 55: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Somatic variant calling with VarScan

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Somatic Variant Calling with Varscan

Variant Calling

---------

VarScan

Somatic

Variant

Annotation

-----

Annovar

Mpileup

Tumeur

---------

Samtools

Mpileup

Mpileup

Normal

---------

Samtools

Mpileup

Aligned and

preprocessed Reads

(BAM)

Aligned and

preprocessed Reads

(BAM)

Tumor

Normal

Variant

Selection

-----

Select

12

7 - 9 AVRIL 2014FORMATION “NGS &CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”

Page 56: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Mpileup

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

1. Use « Mpileup » from « NGS:Samtools » to create pileup files (repeatfor Tumor and normal samples)

Mpileup

1. Use « Mpileup » from « NGS:Samtools » to create pileup files (repeat for Tumor

and normal samples) Anomalous read pairs

are due to the

restriction of the exome

to a region

13

Mpileup

1. Use « Mpileup » from « NGS:Samtools » to create pileup files (repeat for Tumor

and normal samples) Anomalous read pairs

are due to the

restriction of the exome

to a region

13

Anomalousread pairs aredue to therestriction ofthe exome to aregion

Page 57: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Pileup format

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

• Describes the base-pair information at each position

Pileup format

• Describes the base-pair information at each position

Reference base

Number of reads

covering the site

(total depth)

Read bases:

. / , = match on forward/reverse strand

ACGTN / acgtn = mismatch on forward/reverse strand

`-\+[0-9]+[ACGTNacgtn]+‘ indicates an indel

chr12 112888238 A 108 .$,$.,,.,,..,,.,.,..,.,,.,.,,.,..,,..,,.,.,.,.,,.,.,.,,,.,.

,.,.,,,.,.,.,.,.,,.,.,.,.,,,,.,,.,.,..,,,,,.,.,,,,^F,

=4 =??????@??@?@@ @=@ ??@ ? @? ?

??< ? ??@??????? ? @??? ? ??@??

A???@@ ?@@???AB????= ? @

@@??@@?@ A 00

chr12 112888239 C 108

.$t,.,,.T,tT,.,T.,.,t.tTtt.tTT,t.T,tTt.tT,T,,.t

TtTttt.,.,Tt.ttt.,T,.,.tT,,T,T,.tT,,t,TttTtT,T.

,tt,,T,T,tttt^F.^F,

936 78??6??6 45<875? ??? ?@6

@6???<?=6 66= ? ???? 6=7??=???<8@7=7=?

77?7?8 ??78?7????? 8 <8??88 9?8 ?0048

Base qualities

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

14

Page 58: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Somatic variant calling with VarScan

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Somatic Variant Calling with Varscan

Variant Calling

---------

VarScan

Somatic

Variant

Annotation

-----

Annovar

Mpileup

Tumeur

---------

Samtools

Mpileup

Mpileup

Normal

---------

Samtools

Mpileup

Aligned and

preprocessed Reads

(BAM)

Aligned and

preprocessed Reads

(BAM)

Tumor

Normal

Variant

Selection

-----

Select

15

7 - 9 AVRIL 2014FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”

Page 59: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

VarScan somatic

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

1. Use « VarScan somatic » from « Varscan » to detectvariant• Min-var-freq: minimal allelic frequency to call a variant (10% here)• Min-coverage: minimum coverage to call a variant (in normal and tumor and combined)• Tumor and normal purity: cellularity of your sample

2 output files: SNVs & Indels in VCF format

VarScan Somatic

1. Use « VarScan somatic » from « Varscan » to

detect variant

• Min-var-freq: minimal allelic frequency to call a variant

(10% here)

• Min-coverage: minimum coverage to call a variant (in

normal and tumor and combined)

• Tumor and normal purity: cellularity of your sample

2 output files: SNVs & Indels in VCF format

16

VarScan Somatic

1. Use « VarScan somatic » from « Varscan » to

detect variant

• Min-var-freq: minimal allelic frequency to call a variant

(10% here)

• Min-coverage: minimum coverage to call a variant (in

normal and tumor and combined)

• Tumor and normal purity: cellularity of your sample

2 output files: SNVs & Indels in VCF format

16

Page 60: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

VarScan VCF format

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

• 2 types: VCF (specific to VarScan) Tabulated (available only for VarScan in classical mode)

• VarScan VCF format: classic VCF header (#) but specific variant lines

GT=Genotype (1/1: Homozygous ; 0/1 : Heterozygous) / GQ= Genotype Quality

SS= Somatic Status (0=ref; 1=Germline ; 2=Somatic; 3=LOH ; 5= Unknown)

DP= Quality Read Depth of bases with Phred score >= BAPQ

RD= Depth of reference-supporting bases

AD= Depth of variant-supporting bases

FREQ= Variant allele frequency

DP4= Ref/FWD , Ref/REV, Alt/FWD, Alt/REV

VarScan VCF Format

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR

chr12 250239 . A G 20 PASS

DP=115;SOMATIC;SS=2;

SSC=21;GPV=1; SPV=6.3E-

3

GT:GQ:

DP:RD:AD:

FREQ:DP4

0/0:.:52:48:0:0%

:15,33,0,0

0/1:.:63:50:8:13.79%

:19,31,3,5

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

• VarScan VCF format: classic VCF header (#) but specific variant lines

• 2 types:

• VCF (specific to VarScan)

• Tabulated (available only for VarScan in classical mode)

17

Page 61: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

VarScan Tabulated Format

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

VarScan Tabulated Format

Chrom Position Ref Cons Reads1 Reads2 VarFreq Strands

1

Strands

2 Qual1 Qual2 Pvalue

Map

Qual1

Map

Qual2 R1 + R1 - R2 + Rs2 - Alt

chr12 113348849 C Y 31 30 49.18% 2 2 27 27 0.98 1 1 19 12 25 5 T

chr12 113354329 G R 72 2 2.70% 2 2 31 26 0.98 1 1 48 24 1 1 A

chr12 113357193 G A 2 72 97.30% 1 2 28 24 0.98 1 1 2 0 45 27 A

chr12 113357209 G A 0 77 100% 0 2 0 29 0.98 0 1 0 0 51 26 A

Cons : Consensus Genotype of Variant Called (IUPAC code):

M -> A or C Y -> C or T D -> A or G or T W -> A or T V -> A or C or G

R -> A or G K -> G or T B -> C or G or T S -> C or G H -> A or C or T

18

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

VarScan Tabulated Format

Chrom Position Ref Cons Reads1 Reads2 VarFreq Strands

1

Strands

2 Qual1 Qual2 Pvalue

Map

Qual1

Map

Qual2 R1 + R1 - R2 + Rs2 - Alt

chr12 113348849 C Y 31 30 49.18% 2 2 27 27 0.98 1 1 19 12 25 5 T

chr12 113354329 G R 72 2 2.70% 2 2 31 26 0.98 1 1 48 24 1 1 A

chr12 113357193 G A 2 72 97.30% 1 2 28 24 0.98 1 1 2 0 45 27 A

chr12 113357209 G A 0 77 100% 0 2 0 29 0.98 0 1 0 0 51 26 A

Cons : Consensus Genotype of Variant Called (IUPAC code):

M -> A or C Y -> C or T D -> A or G or T W -> A or T V -> A or C or G

R -> A or G K -> G or T B -> C or G or T S -> C or G H -> A or C or T

18

7 - 9 AVRIL 2014 FORMATION  “NGS  &  CA NCER  :  ANALYSE  DE  VARIANTS  GÉNOMIQUES”

Page 62: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Variant Annotation

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Variant Annotation

Variant Calling

---------

VarScan

Somatic

Variant

Annotation

-----

Annovar

Mpileup

Tumeur

---------

Samtools

Mpileup

Mpileup

Normal

---------

Samtools

Mpileup

Aligned and

preprocessed Reads

(BAM)

Aligned and

preprocessed Reads

(BAM)

Tumor

Normal

Variant

Selection

-----

Select

19

7 - 9 AVRIL 2014FORMATION “NGS & CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”

Page 63: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Different types of SNVs

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

• SNVs and short indels are the most frequent events: Intergenic Intronic cis-regulatory splice sites frameshift or not synonymous or not begnin or damaging etc...

• Example of SNV one want to pinpoint: non-synonymous + highly deleterious + somatically acquired

Page 64: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Resources dedicated to human geneticvariations

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

• dbSNP and 1000-genomes Population-scale DNA polymorphisms

• COSMIC Catalogue Of Somatic Mutations In Cancer

• Non synonymous SNVs predictions SIFT, Polyphen2 (damaging impact)... PhyloP, GERP++

(conservation)

ANNOVAR• Tools to annotate genetic variations

Page 65: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Annovar

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Use « Annovar » to annotate SNVs and IndelsMulti sample VCF (contains Tumor & normal samples)

RefGene: Gene & Function & AminoAcidChange (HGVS format: c.A155G ; p.Lys45Arg)

1000g2012apr_all: Minor Allele Frequency for all ethnies

ESP6500: Exome Sequencing Project Ljb_all : predictions (SIFT, Polyphen2, LRT,

MutationTaster, PhyloP, GERP++)

Tabulated file

Annovar

Use « Annovar » to annotate SNVs and Indels

Multi sample VCF (contains Tumor & normal samples)

• RefGene: Gene & Function & AminoAcid Change

(HGVS format: c.A155G ; p.Lys45Arg)

• 1000g2012apr_all: Minor Allele Frequency for all

ethnies

• ESP6500: Exome Sequencing Project

• Ljb_all : predictions (SIFT, Polyphen2, LRT,

MutationTaster, PhyloP, GERP++)

Tabulated file

22

Annovar

Use « Annovar » to annotate SNVs and Indels

Multi sample VCF (contains Tumor & normal samples)

• RefGene: Gene & Function & AminoAcid Change

(HGVS format: c.A155G ; p.Lys45Arg)

• 1000g2012apr_all: Minor Allele Frequency for all

ethnies

• ESP6500: Exome Sequencing Project

• Ljb_all : predictions (SIFT, Polyphen2, LRT,

MutationTaster, PhyloP, GERP++)

Tabulated file

22

Page 66: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Variant selection

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Variant Selection

Variant Calling

---------

VarScan

Somatic

Variant

Annotation

-----

Annovar

Mpileup

Tumeur

---------

Samtools

Mpileup

Mpileup

Normal

---------

Samtools

Mpileup

Aligned and

preprocessed Reads

(BAM)

Aligned and

preprocessed Reads

(BAM)

Tumor

Normal

Variant

Selection

-----

Select

23

7 - 9 AVRIL 2014FORMATION “NGS &CA NCER : ANALYSE DE VARIANTS GÉNOMIQUES”

Page 67: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Select variant predicted as somatic

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

• Use the « Select » tools from « Filter and Sort » to select only linesmatching the pattern « SOMATIC »

Select variant predicted as somatic

• Use the « Select » tools from « Filter and Sort » to select only lines matching the

pattern « SOMATIC »

24

Select variant predicted as somatic

• Use the « Select » tools from « Filter and Sort » to select only lines matching the

pattern « SOMATIC »

24

Page 68: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Page 69: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Annexe 1 : Frequently mutated genes in WES

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Fuentes Fajardo KV, Adams D; NISC Comparative SequencingProgram, Mason CE, Sincan M, Tifft C, Toro C, Boerkoel CF, Gahl W, Markello T. Detectingfalse-positive signals in exome sequencing. Hum Mutat. 2012 Apr;33(4):609-13.doi: 10.1002/humu.22033. Epub 2012 Mar 5. PubMed PMID: 22294350; PubMed Central PMCID: PMC3302978.

Potentially false-positives = 2157 GENES !!! (ex : MUCxx, HLA-xxx,

Page 70: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Annexe 2 : without « normal » sample ?

29 janvier 2015 Formation NGS & Cancer - Analyses Exome

Page 71: NGS, Cancer and Bioinformatics - Université Paris-Saclayrssf.i2bc.paris-saclay.fr/transfert/IFSBM/IFSBM_TP... · 2017. 5. 24. · 29 janvier 2015 Formation NGS & Cancer - Analyses

Annexe 3 : how to visualize variants ?

29 janvier 2015 Formation NGS & Cancer - Analyses Exome