coverage sequence formats - karolinska...

6
Bioinformatics in next generation sequencing projects Rickard Sandberg Assistant Professor Department of Cell and Molecular Biology Karolinska Institutet March 2010 Tuesday, March 9, 2010 Overview of Analyses Procedure Raw Image (TB) Sequence format (Gb) Mapped reads or assembled contigs (Gb) ChIP-Seq peaks Image analysis Basecalling Mapping (Assembly) Application Specic analyses Primary Analyses: Visualization 1 2 3 SNP Metagenomic Expression levels Tuesday, March 9, 2010 Coverage File formats Methods Available programs Available data Tuesday, March 9, 2010 Sequence Formats >EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC Fasta file: Fastq file: SOLiD csfasta file >1_39_146_F3 T22100200202311030112002022222002021 >1_39_194_F3 T11022322003020303320012223122202221 SOLiD, QV file >1_39_146_F3 14 6 21 27 5 18 6 15 22 27 18 17 14 18 26 15 24 19 18 18 8 20 17 12 20 6 14 13 23 6 11 12 7 13 4 >1_39_194_F3 26 27 16 27 23 22 23 25 22 10 5 21 4 17 20 26 26 17 25 27 23 25 14 24 26 4 4 4 4 4 4 4 4 4 14 @SRR001666.1 GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC +SRR001666.1 IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC Tuesday, March 9, 2010 Phred Quality Score, Q Each base call has an estimate of the probability of being wrong (error probability, p) Q = -10 * log 10 (p) Phred Quality Score Probability of incorrect base call Base call accuracy 10 1 in 10 90 % 20 1 in 100 99 % 30 1 in 1000 99.9 % 40 1 in 10000 99.99 % 50 1 in 100000 99.999 % Tuesday, March 9, 2010 FastQ encodings Sanger FastQ: Phred score from 0-93 using the ASCII characters 33 – 126 Solexa (+1.3 pipeline): Phred score from 0-62 using the ASCII characters 0-62 Solexa (older pipelines): Solexa score using ASCII characters -5 to 62 SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 S - Sanger Phred+33, 41 values (0, 40) I - Illumina 1.3 Phred+64, 41 values (0, 40) X - Solexa Solexa+64, 68 values (-5, 62) Tuesday, March 9, 2010

Upload: others

Post on 26-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Coverage Sequence Formats - Karolinska Institutetsandberg.cmb.ki.se/.../NextGenSeq_Bioinformatics.pdf · Recently introduced new formats for e#cient viewing of large data sets: -

Bioinformatics in next generation sequencing projects

Rickard SandbergAssistant ProfessorDepartment of Cell and Molecular BiologyKarolinska Institutet

March 2010

Tuesday, March 9, 2010

Overview of Analyses Procedure

Raw Image (TB)

Sequence format (Gb)

Mapped reads or assembled contigs (Gb)

ChIP-Seq peaks

Image analysisBasecalling

Mapping(Assembly)

ApplicationSpeci!c analyses

Primary Analyses:

Visualization

1

2

3

SNP MetagenomicExpression levels

Tuesday, March 9, 2010

Coverage

File formats

Methods

Available programs

Available data

Tuesday, March 9, 2010

Sequence Formats

>EAS54_6_R1_2_1_413_324

CCCTTCTTGTCTTCAGCGTTTCTCCFasta file:

Fastq file:

SOLiD

csfasta file>1_39_146_F3T22100200202311030112002022222002021>1_39_194_F3T11022322003020303320012223122202221

SOLiD, QV file>1_39_146_F314 6 21 27 5 18 6 15 22 27 18 17 14 18 26 15 24 19 18 18 8 20 17 12 20 6 14 13 23 6 11 12 7 13 4 >1_39_194_F326 27 16 27 23 22 23 25 22 10 5 21 4 17 20 26 26 17 25 27 23 25 14 24 26 4 4 4 4 4 4 4 4 4 14

@SRR001666.1 GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC+SRR001666.1 IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

Tuesday, March 9, 2010

Phred Quality Score, Q

Each base call has an estimate of the probability of being wrong (error probability, p)

Q = -10 * log10(p)

Phred Quality Score Probability of incorrect base call Base call accuracy10 1 in 10 90 %20 1 in 100 99 %30 1 in 1000 99.9 %40 1 in 10000 99.99 %50 1 in 100000 99.999 %

Tuesday, March 9, 2010

FastQ encodings

• Sanger FastQ: Phred score from 0-93 using the ASCII characters 33 – 126

• Solexa (+1.3 pipeline): Phred score from 0-62 using the ASCII characters 0-62

• Solexa (older pipelines): Solexa score using ASCII characters -5 to 62

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126

S - Sanger Phred+33, 41 values (0, 40) I - Illumina 1.3 Phred+64, 41 values (0, 40) X - Solexa Solexa+64, 68 values (-5, 62)

Tuesday, March 9, 2010

Page 2: Coverage Sequence Formats - Karolinska Institutetsandberg.cmb.ki.se/.../NextGenSeq_Bioinformatics.pdf · Recently introduced new formats for e#cient viewing of large data sets: -

FastQ example

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

ASCII encoding schemes

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126

S - Sanger Phred+33, 41 values (0, 40) I - Illumina 1.3 Phred+64, 41 values (0, 40) X - Solexa Solexa+64, 68 values (-5, 62)

Tuesday, March 9, 2010

Overview of Analyses Procedure

Raw Image (TB)

Sequence format (Gb)

Mapped reads or assembled contigs (Gb)

ChIP-Seq peaks

Image analysisBasecalling

Mapping(Assembly)

ApplicationSpeci!c analyses

Primary Analyses:

Visualization

1

2

3

SNP MetagenomicExpression levels

Tuesday, March 9, 2010

AssemblyVelvet and SOAPdenovode novo genomic assembler specially designed for short read sequencing technologies

Nature 2009

de Bruijn graph

Tuesday, March 9, 2010

Mapping of reads

Task: Map millions of short sequences (25-100 nt) onto a genome (3 000 Mbp ) or transcriptome

Mismatches (sequencing errors and SNPs)

Unique / Repetitive matches

Indels (Normal variation, CNVs)

Large rearrangements (translocations)

BLAST, BLAT tools not designed for these tasks

Tuesday, March 9, 2010

Commly used programs

Program Approach Comments

Bowtie Burrow-Wheeler Transformation (BWT)

Illumina, SOLiD, fast

MAQ Spaced Seed Indexing Illumina, (SOLiD), SNPs

BWA Burrow-Wheeler Transformation (BWT)

Illumina, fast

Novoalign Needleman-Wunch Alignment

Illumina, indels, not free

ZOOM Designed spaced seeds Illumina, fast, indels, not free

SOAP2 Burrow-Wheeler Transformation (BWT)

Mappers from Illumina (ELAND) and SOLiD (mapreads)

Tuesday, March 9, 2010

MAQ bowtie

Tuesday, March 9, 2010

Page 3: Coverage Sequence Formats - Karolinska Institutetsandberg.cmb.ki.se/.../NextGenSeq_Bioinformatics.pdf · Recently introduced new formats for e#cient viewing of large data sets: -

Which one to use?

Fast, well maintained:bowtie

Perhaps the most accurate:novoalign

Tuesday, March 9, 2010

Paired reads

Picking the right alignment

2 mismatches Exact match

Bowtie reports the “best” alignment it comes across, but this isn’t

always the right one. To do a better job, we want paired end

reads

Tuesday, March 9, 2010

Mapped Alignments

Formats for storing alignments should include:

genomic coordinates

mismatches, insertion, deletions etc.

quality information

Tuesday, March 9, 2010

Samtools

Sequence Alignment Map (SAM)

Generic Alignment format

Supports long and short reads

Human readable, !exible and compact

Emerging standard

!"#$%%&'()**+&,&*-./01*.20,30)%

4567,8967'3:&';0.6<,896=>&*;0.6?,96@0330++6A,96B-'36C,967*(0.6D,96E'.)!6F,96

?G0/'&5&6F,96H-.G536B,6'3:6IJJJ6F03*(06K.*L0/)6H')'6K.*/0&&5326M-G2.*-#6

NOJJPQ6A!06M0R-03/06'+523(03)%('#6NM?EQ61*.(')6'3:6M?E)**+&,6

<5*531*.('S/&96OT96OJUVWP,6XKEYH$6IPTJTPZ[\

Tuesday, March 9, 2010

M?E6]^'(#+0

@HD VN:1.0 @SQ SN:chr20 LN:62435964 @RG ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891 @RG ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891 read_28833_29006_6945 99 chr20 28833 20 10M1D25M = 28993 195 \ AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG <<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<< NM:i:1 RG:Z:L1 read_28701_28881_323b 147 chr20 28834 30 35M = 28701 -168 \ ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA <<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<< MF:i:18 RG:Z:L2

Tuesday, March 9, 2010

CIGAR Format

M, match/mismatch

I, insertion

D, deletion

S, softclip

...

Ref: GCATTCAGATGCAGTACGC

Read: ccTCAG--GCAGTAgtg

Pos: 5

CIGAR: 2S4M3D6M3S

Tuesday, March 9, 2010

Page 4: Coverage Sequence Formats - Karolinska Institutetsandberg.cmb.ki.se/.../NextGenSeq_Bioinformatics.pdf · Recently introduced new formats for e#cient viewing of large data sets: -

BAM

Binary representation of SAM

Compressed using BGZF library

Reduces "le storage to ~25%

Tuesday, March 9, 2010

Samtools

Library and software package (C, Java)

Creating, sorting, indexing SAM & BAM

Visualizing alignments in command

SNP calling

Short indel detection

Tuesday, March 9, 2010

<]H61*.(')!"#$%6W6A!063'(06*16)!06/!.*(*&*(06N0,2,6/!.[96/!._96/!.O`.'3:*(Q6*.6&/'a*+:6N0,2,6

&/'a*+:IJbUIQ,

!"#$%&'(#'6W6A!06&)'.S326#*&5S*36*16)!0610')-.06536)!06/!.*(*&*(06*.6&/'a*+:,6A!06c.&)6

G'&06536'6/!.*(*&*(065&63-(G0.0:6J,

!"#$%)*+6W6A!0603:5326#*&5S*36*16)!0610')-.06536)!06/!.*(*&*(06*.6&/'a*+:,6A!06

/!.*(]3:6G'&065&63*)653/+-:0:6536)!06:5&#+'>6*16)!0610')-.0,6

@*.60^'(#+096)!06c.&)6IJJ6G'&0&6*16'6/!.*(*&*(06'.06:0c30:6'&6/!.*(M)'.)dJ96

/!.*(]3:dIJJ96'3:6&#'36)!06G'&0&63-(G0.0:6JWPP,

http://genome.ucsc.edu/FAQ/FAQformat.html

track name=pairedReads description="Clone Paired Reads" useScore=1chr22 1000 5000

Tuesday, March 9, 2010

BED continued

strand - Defines the strand - either '+' or '-'.

thickStart - The starting position at which the feature is drawn thickly (for example, the start

codon in gene displays).

thickEnd - The ending position at which the feature is drawn thickly (for example, the stop codon

in gene displays).

itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is

set to "On", this RBG value will determine the display color of the data contained in this BED

line. NOTE: It is recommended that a simple color scheme (eight colors or less) be used with this

attribute to avoid overwhelming the color resources of the Genome Browser and your Internet

browser.

blockCount - The number of blocks (exons) in the BED line.

blockSizes - A comma-separated list of the block sizes. The number of items in this list should

correspond to blockCount.

blockStarts - A comma-separated list of block starts. All of the blockStart positions should be

calculated relative to chromStart. The number of items in this list should correspond to

blockCount.

track name=pairedReads description="Clone Paired Reads" useScore=1chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601

Tuesday, March 9, 2010

WIG format

Wiggle format (WIG) allows the display of continuous-valued data in a track format

Tuesday, March 9, 2010

WIG File Example

Variable step Fixed step

variableStep chrom=chr2300701 12.5300702 12.5300703 12.5300704 12.5300705 12.5is equivalent to:

variableStep chrom=chr2 span=5300701 12.5

fixedStep chrom=chr3 start=400601 step=100112233

Tuesday, March 9, 2010

Page 5: Coverage Sequence Formats - Karolinska Institutetsandberg.cmb.ki.se/.../NextGenSeq_Bioinformatics.pdf · Recently introduced new formats for e#cient viewing of large data sets: -

File Format Information

http://genome.ucsc.edu/FAQ/FAQformat.html

Tuesday, March 9, 2010

Galaxy

Enabling genome-wide tools for biologists without programming:

- FastQ conversions

- Mapping

- Samtools

http://galaxy.psu.edu/Tuesday, March 9, 2010

Visualization

Integrated Genome Viewer (Broad Inst.)

Custom tracks at UCSC Genome Browser

Tuesday, March 9, 2010

UCSC Genome Browser

Recently introduced new formats for e#cient viewing of large data sets:

- BedGraph

- BigWig

Add as custom tracks

Tuesday, March 9, 2010

Integrated Genome Viewer

Imports many mentioned formats (SAM, BAM, BED etc)

Excellent for visualization of RNA-Sequencing or ChIP-sequencing data

Can also download/Visualize data from their server already prepared

Tuesday, March 9, 2010

Examples

Tuesday, March 9, 2010

Page 6: Coverage Sequence Formats - Karolinska Institutetsandberg.cmb.ki.se/.../NextGenSeq_Bioinformatics.pdf · Recently introduced new formats for e#cient viewing of large data sets: -

Data Repositories

Short Read Archive (fastq)http://www.ncbi.nlm.nih.gov/sra

Gene Expression Omnibus (bed, wig, fastq)http://www.ncbi.nlm.nih.gov/geo/

Processed Data on author websites (bed, wig)

Tuesday, March 9, 2010

SEQAnswers, an active forum for discussions on next-generation sequencing methods and bioinformatics

http://seqanswers.com/Tuesday, March 9, 2010

Summary

Data: SRA or GEO

Process data: Galaxy

Visualization: IGV or UCSC GB

Tuesday, March 9, 2010