introduction to ngs variant calling analysis (ueb-uat bioinformatics course - session 2.3 - vhir,...

41
Hospital Universitari Vall d’Hebron Institut de Recerca - VHIR Institut d’Investigació Sanitària de l’Instituto de Salud Carlos III (ISCIII) Bioinformàtica per la Recerca Biomèdica http://ueb.vhir.org/2014BRB Ferran Briansó [email protected] 15/05/2014 INTRODUCTION TO NGS VARIANT CALLING ANALYSIS

Upload: ueb

Post on 10-May-2015

1.132 views

Category:

Science


1 download

DESCRIPTION

Course: Bioinformatics for Biomedical Research (2014). Session: 2.3- Introduction to NGS Variant Calling Analysis. Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.

TRANSCRIPT

Page 1: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

Hospital Universitari Vall d’HebronInstitut de Recerca - VHIR

Institut d’Investigació Sanitària de l’Instituto de Salud Carlos III (ISCIII)

Bioinformàtica per la Recerca Biomèdica

http://ueb.vhir.org/2014BRB

Ferran Briansó[email protected]

15/05/2014

INTRODUCTION TO NGSVARIANT CALLING ANALYSIS

Page 2: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

1. NGS WORKFLOW OVERVIEW

2. WET LAB STEPS

3. IMPORTANT SEQUENCING CONCEPTS

4. NGS ANALYSIS WORKFLOW

1. Primary analysis: de-multiplexing, QC

2. Secondary analysis: read mapping and variant calling

3. Tertiary analysis: annotation, filtering...

5. VISUALIZATION

6. COMMON PIPELINES AND FORMATS

7. CONCLUSIONS

5

1

2

3

5

6

PRESENTATION OUTLINE

4

7

Page 3: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

NGS WORKFLOW OVERVIEW1

3Extracted from Dr Kassahn's publicly shared slides (2013)

Page 4: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

LIBRARY PREPARATION2

4

Select targetHybridization-based cature or PCR

Add adaptersContain binding sequencesBarcodesPrimer sequences

Amplify material

2

Page 5: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

5

Select targetHybridization-based cature or PCR

Add adaptersContain binding sequencesBarcodesPrimer sequences

Amplify material

A) Fragment DNA

B) End-repair

C) A-tailing, adapter ligation and PCR

D) Final library contains• sample insert• indices (barcodes)• flowcell binding sequences• primer binding sequences

LIBRARY PREPARATION2

Page 6: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

6

Select targetHybridization-based cature or PCR

Add adaptersContain binding sequencesBarcodesPrimer sequences

Amplify material

LIBRARY PREPARATION2

Page 7: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

TEMPLATE PREPARATION

7

Attachment of librarye.g. To Illumina Flowcell

Amplification of library moleculese.g. Brigde amplification

2

Page 8: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

BRIDGE AMPLIFICATION

8

2

Page 9: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

SEQUENCING

9

Sequencing-by-Synthesis

Detection by:• Illumina – fluorescence• Ion Torrent – pH• ROCHE 454 – PO4 and light

2

Page 10: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

SEQUENCING-BY-SYNTHESIS (ILLUMINA)

10

2

Page 11: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

IMPORTANT SEQUENCING CONCEPTS1

11

Barcoding/Indexing: allows multiplexing of different samples

Single-end vs paired-end sequencing

Coverage: avg. number reads per target

Quality scores (Qscore): log-scales!

3

Page 12: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

NGS DATA ANALYSIS WORKFLOW4

12

Page 13: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

DE-MULTIPLEXING (BARCODE SPLITTING)

13

4

Page 14: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

FASTQ FORMAT

14

4

see en.wikipedia.org/wiki/FASTQ_format

Page 15: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

SEQUENCE QUALITY: fastQC

15

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Details of the output https://docs.google.com/document/pub?id=16GwPmwYW7o_r-ZUgCu8-oSBBY1gC97TfTTinGDk98Ws

4

Page 16: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

NGS DATA ANALYSIS WORKFLOW4

16

Page 17: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

READ MAPPING (BASIC ALIGNMENT)4

17

Comparison against reference genome(! not assembly !)

Many aligners(short reads, longer reads, RNAseq...)Examples: BWA, Bowtie

SAM/BAM files

Page 18: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

BURROWS-WHEELER ALIGNMENT TOOL (BWA)

18

Popular tool for genomic sequence data (not RNASeq!)

Li and Durbin 2009 Bioinformatics

Challenge: compare billion of short sequence reads (.fastq file) against human genome (3Gb)

Burrows-Wheeler Transform to “index” the human genome and allow memory-efficient and fast string matching between sequence read and reference genome

4

Li & Durbin 2009 Bionformatics

Page 19: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

SAM/BAM FILES

19

4

see http://samtools.sourceforge.net/SAMv1.pdf

Page 20: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

SAM/BAM FILES

20

@ Header (information regarding reference genome, alignment method...)

1) Read ID (QNAME)2) Bitwise FLAG (first/second read in pair, both reads mapped...)3) ReferenceSequence Name (RNAME)4) Position (POS, coordinate)5) MapQuality (MAPQ = -10log10P[wrong mapping position])6) CIGAR (describes alignment – matches, skipped regions, insertions..)7) ReferenceSequence (RNEXT, Ref seq of the pair)8) Position of the pair (PNEXT)9) TemplateLength (TLEN)10) ReadSequence11) QUAL (in Fastq format, '*' if NA)...

4

Page 21: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

VARIANT CALLING

21

Identify sequence variantsDistinguish signal vs noiseVCF filesExamples: SAMtools, SNVmix

4

Page 22: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

SEQUENCE VARIANTS

22

Differences to the reference

4

Page 23: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

SEQUENCE VARIANTS

23

Sanger: is it real??

NGS: read count

Provides confidence (statistics!)

Sensitivity tune-able parameter (dependent on coverage)

4

Page 24: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

VARIANT CALLING: GATK

24

Genome Analysis Toolkit (BROAD Institute)

• Initially developed for 1000 Genomes Project

• Single or multiple sample analysis (cohort)

• Popular tool for germline variant calling

• Evaluates probability of genotype given read data

4

see http://www.broadinstitute.org/gatk/and McKenna et al. Genome Research 2010

Page 25: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

SOMATIC VARIANT CALLING

25

Somatic mutations can occur at low freq. (<10%) due to:

• Tumor heterogeneity (multiple clones)

• Low tumor purity (% normal cells in tumor sample)

Requires different thresholds than germline variant calling when evaluating signal vs noise

Trade-off between sensitivity (ability to detect mutation) and specificity (rate of false positives)

Nature Reviews Cancer 12, 323-334 (May 2012)

4

Page 26: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

INDELS DETECTION1

26

Small insertions/ deletions

The trouble with mapping approaches

4

modified from Heng Li (Broad Institute)

Page 27: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

INDELS DETECTION

27

Small insertions/ deletions

The trouble with mapping approaches

4

Page 28: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

INDELS DETECTION

28

Small insertions/ deletions

The trouble with mapping approaches

4

Page 29: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

RE-ALIGNMENT

29

Re-align considering multi-read context, SNPs & INDELS previous info...

4

adapted from Andreas Schreiber

Page 30: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

EVALUATING VARIANT QUALITY

30

TAKING INTO ACCOUNT:

• Coverage at position

• Number independent reads supporting variant

• Observed allele fraction vs expected (somatic / germline)

• Strand bias

• Base qualities at variant position

• Mapping qualities of reads supporting variant

• Variant position within reads (near ends or at centre)

4

Page 31: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

VCF FILES

31

Variant Call Format

Standard for reporting variants from NGS

Describes metadata of analysis and variant calls

Text file format (open in Text Editor or Excel)

!!! Not a MS Office vCard !!!

see http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

4

Page 32: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

VCF FILES

32

4

Page 33: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

NGS DATA ANALYSIS WORKFLOW

33

4

Page 34: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

VARIANT ANNOTATION

34

Provide biological & clinical context

Identify disease-causing mutations(among 1000s of variants)

4

Page 35: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

ANNOTATION OVERVIEW

35

4

Page 36: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

VARIANT FILTERING AND PRIORIZATION

36

PURPOSE: Identify pathogenic or

disease-associated mutation(s) Reduce candidate variants

to reportable setCOMMON STEPS:

• Remove poor quality variant calls

• Remove common polymorphisms

• Prioritize variants with high functional impact

• Compare against known disease genes

• Consider mode of inheritance (autosomal recessive, X-linked...)

• Consider segregation in family (where multiple samples available)

4

Page 37: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

NGS DATA ANALYSIS WORKFLOW

37

5

Page 38: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

VISUALIZATION – IGV (or Genome Browser, Circos...)

38

5

provided by Katherine Pillman

Page 39: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

COMMON PIPELINE6

39

bcl2fastq (Illumina)FastQC (open-source)

Exomes (HiSeq): BWA(open-source), GATK (Broad)

Gene panels (MiSeq, PGM): MiSeq Reporter (Illumina) Torrent Suite (Ion Torrent)

Custom scripts and third party tools (Annovar, snpEff, PolyPhen, SIFT...)

Commercial annotation software(GeneticistAssistant, VariantStudio...)

Page 40: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

COMMON DATA FORMATS6

40

.bcl

.fastq

.BAM

.VCF

.csv

.txt

.xls

.html ...

Page 41: Introduction to NGS Variant Calling Analysis (UEB-UAT Bioinformatics Course - Session 2.3 - VHIR, Barcelona)

CONCLUSIONS7

41

NGS data - the new currency of (molecular) biology

Broad applications (ecology, evolution, ag sciences, medical research and clinical diagnostics...).

Rapidly evolving (sequencing technologies, library preparation methods, analysis approaches, software).

Different tools/pipelines/parametrization gives different results, (more standards needed).

Bioinformatics pipelines typically combine vendor software, third-party tools and custom scripts.

Requires skills in scripting, Linux/Unix, HPC.

Requires advanced hardware (not always available).

Understanding of data (SE, PE, RNA-Seq) important for successful analysis.