workshop ngs data analysis - 1

36
Sequencing data analysis Workshop – part 1 / main principles and data formats Outline Introduction Sequencing flow Main data formats throughout this flow Maté Ongenaert

Upload: mate-ongenaert

Post on 08-May-2015

1.758 views

Category:

Education


1 download

DESCRIPTION

Workshop 1 - NGS data analysis - Introduction - Sequence flow and analysis - Data formats throughout sequencing data analysis

TRANSCRIPT

Page 1: Workshop NGS data analysis - 1

Sequencing data analysisWorkshop – part 1 / main principles and data formats

Outline

Introduction

Sequencing flow

Main data formats throughout this flow

Maté Ongenaert

Page 2: Workshop NGS data analysis - 1

IntroductionSequencing technology

The real cost of sequencing

Page 3: Workshop NGS data analysis - 1

IntroductionSequencing technology

The real cost of sequencing

Question:

- What is the fraction of the cost of a NGS study of:(1) Sample collection and experimental design

(2) Sequencing itself(3) Data reduction and management

(4) Downstream analysis

Is this a surrealistic question? Not at all, think of you writing a grant proposal and propose a NGS ChIP-seq experiment of 24

samples.

You would need 3 HiSeq 2000 lanes that cost you 8000 €Sample preperation cost 1000€

Others 1000 €Do you ever include analysis costs?? Personel, infrastructure,…

Page 4: Workshop NGS data analysis - 1

IntroductionSequencing technology

The real cost of sequencing

Page 5: Workshop NGS data analysis - 1

IntroductionSequencing technology

Page 6: Workshop NGS data analysis - 1

IntroductionSequencing technology

Page 7: Workshop NGS data analysis - 1

IntroductionSequencing technology

Page 8: Workshop NGS data analysis - 1

IntroductionSequencing technology

Page 9: Workshop NGS data analysis - 1

IntroductionSequencing technology

Page 10: Workshop NGS data analysis - 1

Sequencing data analysisWorkshop – part 1 / main principles and data formats

Outline

Introduction

Sequencing flow

Main data formats throughout this flow

Maté Ongenaert

Page 11: Workshop NGS data analysis - 1

Sequencing flowSteps in sequencing experiments

Data analysis

Raw machine reads… What’s next?

Preprocessing (machine/technology)- adaptors, indexes, conversions,…- machine/technology dependent

Reads with associated qualities (universal)- FASTQ

- QC check

Depending on application (general applicable)- ‘de novo’ assembly of genome (bacterial genomes,…)

- Mapping to a reference genome mapped reads- SAM/BAM/…

High-level analysis (specific for application)- SNP calling- Peak calling

Page 12: Workshop NGS data analysis - 1

Sequencing flowSteps in sequencing experiments

Page 13: Workshop NGS data analysis - 1

Sequencing data analysisWorkshop – part 1 / main principles and data formats

Outline

Introduction

Sequencing flow

Main data formats throughout this flow

Maté Ongenaert

Page 14: Workshop NGS data analysis - 1

Sequencing flowSteps in sequencing experiments

Main data formats:- Raw reads

- Mapped reads- Application dependent: ChIP-seq peaks, SNPs: their location and their characteristics

> Intended for: visualization / further analysis (by humans or computers) / reduction ??

Page 15: Workshop NGS data analysis - 1

Sequencing data formatsRaw reads

Raw sequence reads:

- Represent the sequence ~ FASTA

- Extension: represent the quality, per base ~ FASTQ – Q for quality @SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

>SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

- OK, the strange signs at the last line indicate the quality at the corresponding base… But what’s the decoding scheme? (Nerd alert ahead !!)

- We want to represent quality scores ~ Phred scores- Q= -10 log P (with P being the chance of a base called in error)

Phred quality scores are logarithmically linked to error probabilities

Phred Quality Score Probability of incorrect base call Base call accuracy

20 1 in 100 99 %

30 1 in 1000 99.9 %

40 1 in 10000 99.99 %

Page 16: Workshop NGS data analysis - 1

Sequencing data formatsRaw reads

- Phred scores thus typically have 2 digits – you want one digit to allow correspondance in the file… What would a nerd do? Use ASCII as lookup-table of course! one character ~ one decimal number

Page 17: Workshop NGS data analysis - 1

Sequencing data formatsRaw reads

@SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

- Ok, thus 5 actually is 53… But the real charachters only start at 33… So 5 is actually 53 -33 = 20 phred quality…

Page 18: Workshop NGS data analysis - 1

Sequencing data formatsRaw reads

@SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 Example of the identifier line for Illumina data (non-multiplexed): #@machine_id:lane:tile:x:y:multiplex:pair @HWUSI-EAS100R:6:73:941:1973#0/1

- Phred + 33 Sanger- Illumina 1.3 + Phred +64- Illumina 1.5 + Phred +64- Illumina 1.8 + Phred +33- Solid Sanger

Check your instument + version FastQC will give you a hint which scoring scheme is probably used

Extensions: FASTQ / FQ

Page 19: Workshop NGS data analysis - 1

Sequencing data formatsRaw reads

- Special: SRA files from NCBI/EBI Sequence Read Archive- Contains raw sequence data from (GEO) studies for all kinds of instruments and

platforms- Exercice: we have submitted NGS (MBD-seq) for 8 NB cell lines into GEO and the raw

data in SRA, find the SRA files. How would you obtain our originally submitted FASTQ files? (HINT: SRA Toolkit)

- Exercice (caution: nerd alert): working in the terminal… Retrieve the FASTQ file from the SRA file and perform FastQC analysis

Page 20: Workshop NGS data analysis - 1

Linux… for human beings?The terminal

What they show in ‘The matrix’ is a real Linux-terminal and real commands…

Page 21: Workshop NGS data analysis - 1

Linux… for human beings?The terminal

Page 22: Workshop NGS data analysis - 1

Linux… for human beings?The terminal

Server: ***********Port: *****

Login: *********Pasw: *********You will not see that you are typing something…

Page 23: Workshop NGS data analysis - 1

Linux… for human beings?The terminal

You are interactively logged in now! Meaning everything you type is sent to the server and executed

+ Fast, no eye-candy+ Easy to develop a command-line interface

- Not so intuitive- Steep learning curve- High nerd-level

You may have to type bash to see a line that starts with student@mellfire:/home/student

Where are you?/ is root/home is the folder with user documents

Page 24: Workshop NGS data analysis - 1

Linux… for human beings?The terminal

cdChange directory - cd .. (go to higher level) – cd ../../..

mkdirMake directory (is a folder)

cpCopy

mvMove

ls (-ahl)List all contents of a folder (DOS: dir)

rmRemove (DOS: del)

manManual (Q to quit man)

Page 25: Workshop NGS data analysis - 1

Linux… for human beings?The terminal

viText editor (:q! to exit from vi)

head and tailSee first lines / last lines of a textfile

topTable of processes

who and whoamiLists of users logged in and useful command for people with schizophrenia

Page 26: Workshop NGS data analysis - 1

Linux… for human beings?The terminal

Page 27: Workshop NGS data analysis - 1

Sequencing data formatsMapped reads

- Mapping: ‘align’ these raw reads to a reference genome- Single-end or paired-end data?- How would you align a short read to the reference?

- Old-school: Smith-Watherman, BLAST, BLAT,…- Now: mapping tools for short reads that use intelligent indexing and allow mismatches

Algorithm Other featuresHash table Suffix tree Merge sorting

Program Reference Hash reference

Hash reads Suffix tree Enhanced

suffix array FM-index Merge sorting Colorspace 454 Quality Paired end Long reads Bisulfite

SOAP [51] X X X XMAQ [54] X X X X X

Mosaik X X X X XEland X X

SSAHA2 [61] X X XBowtie [67] X X X X

BWA [69] X X X XBWA-SW [69] X X X X XSOAP2 [70] X X X X X

Page 28: Workshop NGS data analysis - 1

Sequencing data formatsMapped reads

- Most commonly used worldwide and in our lab as well: BWA and Bowtie, both using Burrows-Wheeler transformations and FM indexes

- Optimized for short NGS reads (from about 30 bp to +- 200 bp)- Versions exist for longer reads (such as 454): Bowtie2 and BWA-SW

- What would a file contain, describing mapped reads?- Position: chr / start / stop- Sequence: read / references- Mismatches / indels / vs. the reference- Quality informations

- Few years ago, each tool had its own output format Bowtie,…- Now moving to a common file format SAM / BAM (Sequence Alignment/Map)

Page 29: Workshop NGS data analysis - 1

Sequencing data formatsMapped reads

- Now moving to a common file format SAM / BAM (Sequence Alignment/Map) DESCRIPTION OF THE 11 FIELDS IN THE ALIGNMENT SECTION # QNAME: template name #FLAG #RNAME: reference name # POS: mapping position #MAPQ: mapping quality #CIGAR: CIGAR string #RNEXT: reference name of the mate/next fragment #PNEXT: position of the mate/next fragment #TLEN: observed template length #SEQ: fragment sequence #QUAL: ASCII of Phred-scale base quality+33 #Headers @HD VN:1.3 SO:coordinate @SQ SN:ref LN:45 #Alignment block r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1 r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *

Page 30: Workshop NGS data analysis - 1

Sequencing data formatsMapped reads

- BAM: binary version of SAM: not human readable but indexed for fast access for other tools / visualisation / …

- Exercise: view a BAM file in IGV

Page 31: Workshop NGS data analysis - 1

Sequencing data formatsOther formats

- BED files (location / annotation / scores): Browser Extensible DataUsed for mapping / annotation / peak locations / - extension: bigBED (binary)

FIELDS USED: # chr # start # end # name # score # strand track name=pairedReads description="Clone Paired Reads" useScore=1 #chr start end name score strand chr22 1000 5000 cloneA 960 + chr22 2000 6000 cloneB 900 –

- BEDGraph files (location, combined with score)Used to represent peak scores

track type=bedGraph name="BedGraph Format" description="BedGraph format" visibility=full color=200,100,0 altColor=0,100,200 priority=20 #chr start end score chr19 59302000 59302300 -1.0 chr19 59302300 59302600 -0.75 chr19 59302600 59302900 -0.50

Page 32: Workshop NGS data analysis - 1

Sequencing data formatsOther formats

- WIG files (location / annotation / scores): wiggleUsed for visulization or summarize data, in most cases count data or normalized count data (RPKM) – extension: BigWig – binary versions (often used in GEO for ChIP-seq peaks)

browser position chr19:59304200-59310700 browser hide all #150 base wide bar graph at arbitrarily spaced positions, #threshold line drawn at y=11.76 #autoScale off viewing range set to [0:25] #priority = 10 positions this as the first graph track type=wiggle_0 name="variableStep" description="variableStep format" visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255 yLineMark=11.76 yLineOnOff=on priority=10 variableStep chrom=chr19 span=150 59304701 10.0 59304901 12.5 59305401 15.0 59305601 17.5 59305901 20.0 59306081 17.5

Page 33: Workshop NGS data analysis - 1

Sequencing data formatsOther formats

- GFF format (General Feature Format)Used for annotation of genetic / genomic features – such as all coding genes in EnsemblOften used in downstream analysis to assign annotation to regions / peaks / …

FIELDS USED: # seqname (the name of the sequence) # source (the program that generated this feature) # feature (the name of this type of feature – for example: exon) # start (the starting position of the feature in the sequence) # end (the ending position of the feature) # score (a score between 0 and 1000) # strand (valid entries include '+', '-', or '.') # frame (if the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be '.'.) # group (all lines with the same group are linked together into a single item) track name=regulatory description="TeleGene(tm) Regulatory Regions" #chr source feature start end scores tr fr group chr22 TeleGene enhancer 1000000 1001000 500 + . touch1 chr22 TeleGene promoter 1010000 1010100 900 + . touch1 chr22 TeleGene promoter 1020000 1020000 800 - . touch2

Page 34: Workshop NGS data analysis - 1

Sequencing data formatsOther formats

- VCF format (Variant Call Format)For SNP representation

Page 35: Workshop NGS data analysis - 1

Sequencing data formatsOther formats

- http://genome.ucsc.edu/FAQ/FAQformat.html

- UCSC brower data formats, including all most commonly used formats that are accepted and widely used

- In addition, ENCODE data formats (narrowPeak / broadPEAK)

Page 36: Workshop NGS data analysis - 1

Blokde Van…

ETER