bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... ·...

Bioinformatica e analisi dei genomi

Anno 2015/2016

Pierpaolo Maisano Delsermail: pmaisanodelser@mnhn.fr

• Laurea Triennale: ScienzeBiologiche, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;

• Laurea Specialistica: ScienzeBiomolecolari e Cellulari, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;

• PhD in Genetics, University of Leicester, prof. Mark A. Jobling;

• Post‐doctoral fellow EPHE‐MNHN, Paris, Dr. Stefano Mona.

Background

• Laurea Triennale: ScienzeBiologiche, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;

• Laurea Specialistica: ScienzeBiomolecolari e Cellulari, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;

• PhD in Genetics, University of Leicester, prof. Mark A. Jobling;

• Post‐doctoral fellow EPHE‐MNHN, Paris, Dr. Stefano Mona.

Background

Cusco, Marzo 2009

Muséum national d'Histoire naturelle ‐ Paris

Informazioni pratiche

• Teoria + pratica;

• Software and tools;

• Files;

• Slides on the website;

• Argomenti nuovi / argomenti gia’ trattati;

Programma

• next‐generation sequencing (NGS)…come, quando, perche’?

• un esempio di gestione e analisi dati NGS:

• tipo di dato;• file e formati;• programmi;• interpretazione dei risultati;• stima dell’errore;• quando fermarsi?

• Applicazioni e/o progetti su diversi organismi.

capture: exome/custom/cancer

amplicon sequencing

whole genome

mapping to a reference genome

de‐novoassembly

sequencing

unalignedreads QC

mapping refinement

mapping QCassembly QC

whole transcriptome

amplicon sequencing: fixed/custom

DNA‐seq

RNA‐seq

reads trimming

NGS: come, quando, perché?

Filtering

Validation

Domanda: quando? Domanda: perche’?

Domanda: quando?

Risposta: quando ha senso!

Domanda: perche’?

Domanda: quando?

• Amplicone 400bp in 100 individui? → Sanger sequencing

Domanda: perche’?

Domanda: quando?

• 50 ampliconi in 100 individui? → NGS + target capture

Domanda: perche’?

Domanda: quando?

• Gene conversion, elementiripetuti, recombination breakpoints? → NGS + Sanger sequencing

Domanda: perche’?

Domanda: quando?

• Gene conversion, elementiripetuti, recombination breakpoints? → NGS + Sanger sequencing

Domanda: perche’?

Risposta: la vostra idea per un progetto!

un esempio di gestione e analisi dati NGS

Nanopore minIon/gridIon

Pacific Bioscience (PacBio)

Ion torrent PGM/Proton

Roche 454

Illumina MiSeq/HiSeq

capture: exome/custom/cancer

amplicon sequencing

whole genome

mapping to a reference genome

de‐novoassembly

sequencing

unalignedreads QC

mapping refinement

mapping QCassembly QC

whole transcriptome

amplicon sequencing: fixed/custom

DNA‐seq

RNA‐seq

reads trimming

Filtering

Validation

• progetto

• progetto:applicazione (whole genomes? Exomes? Target capture? Amplicon sequencing?)

• progetto

• progetto:applicazione:scopo (SNPs, indels, repeated elements, CNVs…)

• progetto

• progetto:applicazione:scopo (SNPs, indels, repeated elements, CNVs…)

• progetto:applicazione:scopo:coverage (SNPs, indels, repeatedelements, CNVs…)

Project:

• Saccharomyces cerevisiae;

• Genome: 16 chromosomes, ~12.5Mb, ~6200 genes;

• Whole genome sequencing;

• Illumina platform;

• Paired‐end reads, 1 library, 2 lanes.

fragment ========================================fragment + adaptors ~~~========================================~~~SE read ‐‐‐‐‐‐‐‐‐>PE reads R1‐‐‐‐‐‐‐‐‐> <‐‐‐‐‐‐‐‐‐R2unknown gap ..................................................

Single‐end (SE) or paired‐end (PE) sequencing.

fragment ========================================fragment + adaptors ~~~========================================~~~SE read ‐‐‐‐‐‐‐‐‐>PE reads R1‐‐‐‐‐‐‐‐‐> <‐‐‐‐‐‐‐‐‐R2unknown gap ..................................................

Single‐end (SE) or paired‐end (PE) sequencing.

raw reads (.fastq) 2. alignment to a reference genomeclose reference?time limited?

distant reference?

stampy

aligned reads (.sam/.bam)

3. bam refinementduplicate removal

local realignment

base recalibration

picardGATK GATK

5. variant calling

SNPs/indels

single/multi‐sample

samtools

raw variants (.vcf)

ready‐to‐use variants (.vcf)

4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)

6. variant filtering and validation

in silico vs in vitro validation

vcftools

variant score recalibration

big datasets

known SNPs/indels

1. Fastq quality control + trimming

Adapters ?Low quality bases?

samtools

.fa/.fasta

.fastq

.sam (.sai)

.bam (.bai)

sequences

read data

mapped reads

mapped reads (binary)

variant information

@IL29_4505:7:24:8932:6562#2/1TAACGGTGGGTGAGTGGTAGTAAGTAGAGGGATGGATGGTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGAGTAGGGTAATGGAGGGTAAGTTC+CDDCDDABBBABABABB@BCACBDABCBBAB@BBCABBBABB?CBCCABABBABBBBABA?ACBAAAAA?BB;BCAABA7AA?B?A??AAA>?A:AA?AA?%?AA@=9

raw reads (.fastq)

gedit s‐6‐1.fastq

Terminal: more s‐6‐1.fastq OR head s‐6‐1.fastq

raw reads (.fastq)

Instrument ID

raw reads (.fastq)

Instrument ID Tile

raw reads (.fastq)

coordinates of the cluster

Instrument ID Tile

raw reads (.fastq)

Instrument ID Tile

Index number

raw reads (.fastq)

Instrument ID

First mate in the pair (paired‐end reads)

TileIndex number

raw reads (.fastq)

Instrument ID

TileIndex number

raw reads (.fastq)

Quality values for each nucleotide

Instrument ID

TileIndex number

raw reads (.fastq)

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Lowest HighestASCII

33 126

Instrument ID

0.2......................26...31........41

Illumina 1.8+ Phred+33, raw reads typically (0, 41)

TileIndex number

Quality values for each nucleotide (base quality score)

raw reads (.fastq)

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Lowest HighestASCII

33 1260.2......................26...31........41

Illumina 1.8+ Phred+33, raw reads typically (0, 41)

Phred‐scale value:

Q = ‐10*log_10P → P = 10‐Q/10

Phred Quality Score(Q)

Probability of incorrect base call

(P)Base call accuracy

10 1 in 10 90%20 1 in 100 99%30 1 in 1000 99.9%40 1 in 10000 99.99%50 1 in 100000 99.999%

raw reads (.fastq)

• Move into folder lane2;

• Open s‐7‐1.fastq

• gedit s‐7‐1.fastq

Terminal: more s‐7‐1.fastq OR head s‐6‐1.fastq

• Are s‐6‐1.fastq and s‐7‐1.fastq coming from two different lanes?

distant reference?

stampy

local realignment

base recalibration

picardGATK GATK

5. variant calling

SNPs/indels

samtools

raw variants (.vcf)

vcftools

big datasets

known SNPs/indels

samtools

1‐ Fastq quality control + trimming

Fastqc: quality control of the raw data coming out from the sequencer

• Evaluation of the quality of the generated data;

• Basic summary statistics of the raw data;

• Several modules to evaluate different features (i.e. adapters; base quality, etc…)

• Feedback (green, orange, red): do not fully rely on that, think what does it mean!!

Per base sequence quality: warning

What can we do to improve the quality at the end of the reads?

Read Trimming: removal of lower‐quality 3' Ends with Low Quality Scores

95‐99 bp 90‐94 bp

Per sequence quality score: pass

Sequence length: pass

Adapters removal1‐ Fastq quality control + trimming

Failed

Warning

Adapters removal1‐ Fastq quality control + trimming

Overrepresented sequences

Removal of overrepresented sequences (PCR primers).

FASTQC references:

• Software website:http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

• Manual:https://insidedna.me/tool_page_assets/pdf_manual/fastqc.pdf

distant reference?

stampy

local realignment

base recalibration

picardGATK GATK

5. variant calling

SNPs/indels

samtools

raw variants (.vcf)

vcftools

big datasets

known SNPs/indels

samtools

distant reference?

stampy

local realignment

base recalibration

picardGATK GATK

5. variant calling

SNPs/indels

samtools

raw variants (.vcf)

vcftools

big datasets

known SNPs/indels

samtools

Alignment : process of determining the most likelylocation within the genome for the observed DNA read

raw reads reference genome

2‐ Alignment to a reference genome

trade‐off: speed vs sensitivity – the higher the accuracy the longer the alignment run

two classes of methods:

Burrows‐Wheeler

• Fast• less robust at high divergence

with reference genome• e.g. bwa

Hashing

• slow (needs more memory)• robust at high divergence with

reference genome• e.g. stampy

the shorter the read the harder is to find its location in the genome

big amount of data: computationally challenging for memory and speed

What if there are several possible places to align your sequencing read?

This may be due to:‐ Repeated elements in the genome‐ Low complexity sequences‐ Reference errors and gaps

raw reads reference genome

low MQ: the probability of mapping to different locations is high, but no perfect multiple matches

high MQ: a single match

MQ0: a perfect multiple match

What if there are several possible places to align your sequencing read?

MQ is a phred‐score of the quality of the alignment

Reference sequence

Element 1 Element 2

Reference sequence

Element 1 Element 2

Sample_1

Reference sequence

Element 1 Element 2

Sample_1

Reference sequence

Sample_1

1 copia

Reference sequence

Element 1 Element 2Element 1

Reference sequence

Element 1 Element 2

Sample_1

Element 1

Reference sequence

Element 1 Element 2

Sample_1

Element 1

Reference sequence

Element 1 Element 2

Sample_1

Element 1

Reference sequence

Element 1 Element 2

Sample_1

Element 1

Perfect mul ple matches → MQ0Not a perfect match → Low MQ

Reference sequence

Element 1 Element 2

Sample_1

Element 1

Perfect mul ple matches → MQ0Not a perfect match → Low MQ

Reference sequence

Sample_1

2 copia

1 copia

Reference sequence

Element 1 Element 2

Sample_1

Reference sequence

Element 1 Element 2

Sample_1

False heterozygous callCluster of heterozygotes

Reference sequence

Element 1 Element 2

Sample_1

False heterozygous callCluster of heterozygotes

Reference sequence

Sample_1

1 copia

2 copia

1 copia

AluSg7

create the index of the reference genome (for bwa, samtools and picard)

bwa index: this is a FM‐index – specific to the algorithm behind this aligner

bwa index -a is Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa

2‐ Alignment to a reference genome: reference sequence

create the index of the reference genome (for bwa, samtools and picard)

bwa index: this is a FM‐index – specific to the algorithm behind this aligner

bwa index -a is Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa

index .fai

samtools faidxSaccharomyces_cerevisiae.EF4.68.dna.toplevel.fa

The index file stores records of sequence identifier, length, the offset of the first sequence character in the file, the number of characters per line, and the number of bytes per line.

index .fai

50 characters

index .fai

50 characters

60 characters

create the dictionary of the reference genome (for samtools, gatk and picard)

dictionary .dict: list of contigs included in the fasta file of the reference genome

java -jar picard.jar CreateSequenceDictionaryREFERENCE=Saccharomyces_cerevisiae.EF4.68.dna.toplevel.fa OUTPUT=Saccharomyces_cerevisiae.EF4.68.dna.toplevel.dict

keep index and dictionary files in the same directory of the reference file!

2‐ Alignment to a reference genome – reference sequence

SequenceName

SequenceLength

SequenceName

SequenceLength

SequenceName

SequenceLength

Path MD5 checksum

2‐ Alignment to a reference genome: mapping with bwa‐mem

Three different algorithm:

1. BWA‐backtrack: for illumina reads up to 100bp;

2. BWA‐SW: long read support, split alignment;

3. BWA‐MEM: long read support, split alignment, faster, more accurate

Three different algorithm:

1. BWA‐backtrack: for illumina reads up to 100bp;

2. BWA‐SW: long read support, split alignment;

3. BWA‐MEM: long read support, split alignment, faster, more accurate

• paired‐end alignment (lane1);

• it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file;

• Option to mark shorter split hits as secondary (not supplementary).

Split read:

Karacok E et al., 2012

bwa mem [options] [RefSeq] [lane1_fastq1] [lane1_fastq2] > lane1.sam

bioinformatica e analisi dei genomidocente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... ·...

Documents

sensorimotor recalibration reverses the temporal

risk prediction models: calibration, recalibration, and...

infrastructure for gatk* best practices pipeline … for...

infoworks citywide recalibration report - welcome … ·...

maintenance, validation, and recalibration of liquid-in

nicmoscoronagraphy: recalibration and the nicmoslegacy...

indel calling pipeline in the gatk - broad institute calling...

gatk tutorial :: variant callset evaluation &...

flow update | calibration recalibration

spandx: a genomics pipeline for comparative analysis of...

retail electricity price recalibration 2021–22: standing

ncat report 14-08 recalibration procedures for the

recalibration of the asphalt layer coefficient not …

gatk assessment follow up

call somatic copy number variants using gatk...

20-line lifesavers: coding simple solutions in the gatk

second biannual recalibration of two spectral gamma-ray

pamas after sales support service, maintenance and...

extending the gatk to support genomics x prize variation...

recalibration of lowband receiver 01 - loco lab