coverage sequence formats - karolinska...

Bioinformatics in next generation sequencing projects

Rickard SandbergAssistant ProfessorDepartment of Cell and Molecular BiologyKarolinska Institutet

March 2010

Tuesday, March 9, 2010

Overview of Analyses Procedure

Raw Image (TB)

Sequence format (Gb)

Mapped reads or assembled contigs (Gb)

ChIP-Seq peaks

Image analysisBasecalling

Mapping(Assembly)

ApplicationSpeci!c analyses

Primary Analyses:

Visualization

1

2

3

SNP MetagenomicExpression levels


Coverage

File formats

Methods

Available programs

Available data


Sequence Formats

>EAS54_6_R1_2_1_413_324

CCCTTCTTGTCTTCAGCGTTTCTCCFasta file:

Fastq file:

SOLiD

csfasta file>1_39_146_F3T22100200202311030112002022222002021>1_39_194_F3T11022322003020303320012223122202221

SOLiD, QV file>1_39_146_F314 6 21 27 5 18 6 15 22 27 18 17 14 18 26 15 24 19 18 18 8 20 17 12 20 6 14 13 23 6 11 12 7 13 4 >1_39_194_F326 27 16 27 23 22 23 25 22 10 5 21 4 17 20 26 26 17 25 27 23 25 14 24 26 4 4 4 4 4 4 4 4 4 14

@SRR001666.1 GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC+SRR001666.1 IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC


Phred Quality Score, Q

Each base call has an estimate of the probability of being wrong (error probability, p)

Q = -10 * log10(p)

Phred Quality Score Probability of incorrect base call Base call accuracy10 1 in 10 90 %20 1 in 100 99 %30 1 in 1000 99.9 %40 1 in 10000 99.99 %50 1 in 100000 99.999 %


FastQ encodings

• Sanger FastQ: Phred score from 0-93 using the ASCII characters 33 – 126

• Solexa (+1.3 pipeline): Phred score from 0-62 using the ASCII characters 0-62

• Solexa (older pipelines): Solexa score using ASCII characters -5 to 62

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126

S - Sanger Phred+33, 41 values (0, 40) I - Illumina 1.3 Phred+64, 41 values (0, 40) X - Solexa Solexa+64, 68 values (-5, 62)


FastQ example

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

ASCII encoding schemes

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126

S - Sanger Phred+33, 41 values (0, 40) I - Illumina 1.3 Phred+64, 41 values (0, 40) X - Solexa Solexa+64, 68 values (-5, 62)


Overview of Analyses Procedure

Raw Image (TB)

Sequence format (Gb)

Mapped reads or assembled contigs (Gb)

ChIP-Seq peaks

Image analysisBasecalling

Mapping(Assembly)

ApplicationSpeci!c analyses

Primary Analyses:

Visualization

1

2

3

SNP MetagenomicExpression levels


AssemblyVelvet and SOAPdenovode novo genomic assembler specially designed for short read sequencing technologies

Nature 2009

de Bruijn graph


Mapping of reads

Task: Map millions of short sequences (25-100 nt) onto a genome (3 000 Mbp ) or transcriptome

Mismatches (sequencing errors and SNPs)

Unique / Repetitive matches

Indels (Normal variation, CNVs)

Large rearrangements (translocations)

BLAST, BLAT tools not designed for these tasks


Commly used programs

Program Approach Comments

Bowtie Burrow-Wheeler Transformation (BWT)

Illumina, SOLiD, fast

MAQ Spaced Seed Indexing Illumina, (SOLiD), SNPs

BWA Burrow-Wheeler Transformation (BWT)

Illumina, fast

Novoalign Needleman-Wunch Alignment

Illumina, indels, not free

ZOOM Designed spaced seeds Illumina, fast, indels, not free

SOAP2 Burrow-Wheeler Transformation (BWT)

Mappers from Illumina (ELAND) and SOLiD (mapreads)


MAQ bowtie


Which one to use?

Fast, well maintained:bowtie

Perhaps the most accurate:novoalign


Paired reads

Picking the right alignment

2 mismatches Exact match

Bowtie reports the “best” alignment it comes across, but this isn’t

always the right one. To do a better job, we want paired end

reads


Mapped Alignments

Formats for storing alignments should include:

genomic coordinates

mismatches, insertion, deletions etc.

quality information


Samtools

Sequence Alignment Map (SAM)

Generic Alignment format

Supports long and short reads

Human readable, !exible and compact

Emerging standard

!"#$%%&'()**+&,&*-./01*.20,30)%

4567,8967'3:&';0.6<,896=>&*;0.6?,96@0330++6A,96B-'36C,967*(0.6D,96E'.)!6F,96

?G0/'&5&6F,96H-.G536B,6'3:6IJJJ6F03*(06K.*L0/)6H')'6K.*/0&&5326M-G2.*-#6

NOJJPQ6A!06M0R-03/06'+523(03)%('#6NM?EQ61*.(')6'3:6M?E)**+&,6

<5*531*.('S/&96OT96OJUVWP,6XKEYH$6IPTJTPZ[\


M?E6]^'(#+0

@HD VN:1.0 @SQ SN:chr20 LN:62435964 @RG ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891 @RG ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891 read_28833_29006_6945 99 chr20 28833 20 10M1D25M = 28993 195 \ AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG <<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<< NM:i:1 RG:Z:L1 read_28701_28881_323b 147 chr20 28834 30 35M = 28701 -168 \ ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA <<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<< MF:i:18 RG:Z:L2


CIGAR Format

M, match/mismatch

I, insertion

D, deletion

S, softclip

...

Ref: GCATTCAGATGCAGTACGC

Read: ccTCAG--GCAGTAgtg

Pos: 5

CIGAR: 2S4M3D6M3S


BAM

Binary representation of SAM

Compressed using BGZF library

Reduces "le storage to ~25%


Samtools

Library and software package (C, Java)

Creating, sorting, indexing SAM & BAM

Visualizing alignments in command

SNP calling

Short indel detection


<]H61*.(')!"#$%6W6A!063'(06*16)!06/!.*(*&*(06N0,2,6/!.[96/!._96/!.O`.'3:*(Q6*.6&/'a*+:6N0,2,6

&/'a*+:IJbUIQ,

!"#$%&'(#'6W6A!06&)'.S326#*&5S*36*16)!0610')-.06536)!06/!.*(*&*(06*.6&/'a*+:,6A!06c.&)6

G'&06536'6/!.*(*&*(065&63-(G0.0:6J,

!"#$%)*+6W6A!0603:5326#*&5S*36*16)!0610')-.06536)!06/!.*(*&*(06*.6&/'a*+:,6A!06

/!.*(]3:6G'&065&63*)653/+-:0:6536)!06:5&#+'>6*16)!0610')-.0,6

@*.60^'(#+096)!06c.&)6IJJ6G'&0&6*16'6/!.*(*&*(06'.06:0c30:6'&6/!.*(M)'.)dJ96

/!.*(]3:dIJJ96'3:6&#'36)!06G'&0&63-(G0.0:6JWPP,

http://genome.ucsc.edu/FAQ/FAQformat.html

track name=pairedReads description="Clone Paired Reads" useScore=1chr22 1000 5000


BED continued

strand - Defines the strand - either '+' or '-'.

thickStart - The starting position at which the feature is drawn thickly (for example, the start

codon in gene displays).

thickEnd - The ending position at which the feature is drawn thickly (for example, the stop codon

in gene displays).

itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is

set to "On", this RBG value will determine the display color of the data contained in this BED

line. NOTE: It is recommended that a simple color scheme (eight colors or less) be used with this

attribute to avoid overwhelming the color resources of the Genome Browser and your Internet

browser.

blockCount - The number of blocks (exons) in the BED line.

blockSizes - A comma-separated list of the block sizes. The number of items in this list should

correspond to blockCount.

blockStarts - A comma-separated list of block starts. All of the blockStart positions should be

calculated relative to chromStart. The number of items in this list should correspond to

blockCount.

track name=pairedReads description="Clone Paired Reads" useScore=1chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601


WIG format

Wiggle format (WIG) allows the display of continuous-valued data in a track format


WIG File Example

Variable step Fixed step

variableStep chrom=chr2300701 12.5300702 12.5300703 12.5300704 12.5300705 12.5is equivalent to:

variableStep chrom=chr2 span=5300701 12.5

fixedStep chrom=chr3 start=400601 step=100112233


File Format Information

http://genome.ucsc.edu/FAQ/FAQformat.html


Galaxy

Enabling genome-wide tools for biologists without programming:

- FastQ conversions

- Mapping

- Samtools

http://galaxy.psu.edu/Tuesday, March 9, 2010

Visualization

Integrated Genome Viewer (Broad Inst.)

Custom tracks at UCSC Genome Browser


UCSC Genome Browser

Recently introduced new formats for e#cient viewing of large data sets:

- BedGraph

- BigWig

Add as custom tracks


Integrated Genome Viewer

Imports many mentioned formats (SAM, BAM, BED etc)

Excellent for visualization of RNA-Sequencing or ChIP-sequencing data

Can also download/Visualize data from their server already prepared


Examples


Data Repositories

Short Read Archive (fastq)http://www.ncbi.nlm.nih.gov/sra

Gene Expression Omnibus (bed, wig, fastq)http://www.ncbi.nlm.nih.gov/geo/

Processed Data on author websites (bed, wig)


SEQAnswers, an active forum for discussions on next-generation sequencing methods and bioinformatics

http://seqanswers.com/Tuesday, March 9, 2010

Summary

Data: SRA or GEO

Process data: Galaxy

Visualization: IGV or UCSC GB


coverage sequence formats - karolinska...

Documents