coverage sequence formats - karolinska...
TRANSCRIPT
Bioinformatics in next generation sequencing projects
Rickard SandbergAssistant ProfessorDepartment of Cell and Molecular BiologyKarolinska Institutet
March 2010
Tuesday, March 9, 2010
Overview of Analyses Procedure
Raw Image (TB)
Sequence format (Gb)
Mapped reads or assembled contigs (Gb)
ChIP-Seq peaks
Image analysisBasecalling
Mapping(Assembly)
ApplicationSpeci!c analyses
Primary Analyses:
Visualization
1
2
3
SNP MetagenomicExpression levels
Tuesday, March 9, 2010
Coverage
File formats
Methods
Available programs
Available data
Tuesday, March 9, 2010
Sequence Formats
>EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCCFasta file:
Fastq file:
SOLiD
csfasta file>1_39_146_F3T22100200202311030112002022222002021>1_39_194_F3T11022322003020303320012223122202221
SOLiD, QV file>1_39_146_F314 6 21 27 5 18 6 15 22 27 18 17 14 18 26 15 24 19 18 18 8 20 17 12 20 6 14 13 23 6 11 12 7 13 4 >1_39_194_F326 27 16 27 23 22 23 25 22 10 5 21 4 17 20 26 26 17 25 27 23 25 14 24 26 4 4 4 4 4 4 4 4 4 14
@SRR001666.1 GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC+SRR001666.1 IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
Tuesday, March 9, 2010
Phred Quality Score, Q
Each base call has an estimate of the probability of being wrong (error probability, p)
Q = -10 * log10(p)
Phred Quality Score Probability of incorrect base call Base call accuracy10 1 in 10 90 %20 1 in 100 99 %30 1 in 1000 99.9 %40 1 in 10000 99.99 %50 1 in 100000 99.999 %
Tuesday, March 9, 2010
FastQ encodings
• Sanger FastQ: Phred score from 0-93 using the ASCII characters 33 – 126
• Solexa (+1.3 pipeline): Phred score from 0-62 using the ASCII characters 0-62
• Solexa (older pipelines): Solexa score using ASCII characters -5 to 62
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126
S - Sanger Phred+33, 41 values (0, 40) I - Illumina 1.3 Phred+64, 41 values (0, 40) X - Solexa Solexa+64, 68 values (-5, 62)
Tuesday, March 9, 2010
FastQ example
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
ASCII encoding schemes
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126
S - Sanger Phred+33, 41 values (0, 40) I - Illumina 1.3 Phred+64, 41 values (0, 40) X - Solexa Solexa+64, 68 values (-5, 62)
Tuesday, March 9, 2010
Overview of Analyses Procedure
Raw Image (TB)
Sequence format (Gb)
Mapped reads or assembled contigs (Gb)
ChIP-Seq peaks
Image analysisBasecalling
Mapping(Assembly)
ApplicationSpeci!c analyses
Primary Analyses:
Visualization
1
2
3
SNP MetagenomicExpression levels
Tuesday, March 9, 2010
AssemblyVelvet and SOAPdenovode novo genomic assembler specially designed for short read sequencing technologies
Nature 2009
de Bruijn graph
Tuesday, March 9, 2010
Mapping of reads
Task: Map millions of short sequences (25-100 nt) onto a genome (3 000 Mbp ) or transcriptome
Mismatches (sequencing errors and SNPs)
Unique / Repetitive matches
Indels (Normal variation, CNVs)
Large rearrangements (translocations)
BLAST, BLAT tools not designed for these tasks
Tuesday, March 9, 2010
Commly used programs
Program Approach Comments
Bowtie Burrow-Wheeler Transformation (BWT)
Illumina, SOLiD, fast
MAQ Spaced Seed Indexing Illumina, (SOLiD), SNPs
BWA Burrow-Wheeler Transformation (BWT)
Illumina, fast
Novoalign Needleman-Wunch Alignment
Illumina, indels, not free
ZOOM Designed spaced seeds Illumina, fast, indels, not free
SOAP2 Burrow-Wheeler Transformation (BWT)
Mappers from Illumina (ELAND) and SOLiD (mapreads)
Tuesday, March 9, 2010
MAQ bowtie
Tuesday, March 9, 2010
Which one to use?
Fast, well maintained:bowtie
Perhaps the most accurate:novoalign
Tuesday, March 9, 2010
Paired reads
Picking the right alignment
2 mismatches Exact match
Bowtie reports the “best” alignment it comes across, but this isn’t
always the right one. To do a better job, we want paired end
reads
Tuesday, March 9, 2010
Mapped Alignments
Formats for storing alignments should include:
genomic coordinates
mismatches, insertion, deletions etc.
quality information
Tuesday, March 9, 2010
Samtools
Sequence Alignment Map (SAM)
Generic Alignment format
Supports long and short reads
Human readable, !exible and compact
Emerging standard
!"#$%%&'()**+&,&*-./01*.20,30)%
4567,8967'3:&';0.6<,896=>&*;0.6?,96@0330++6A,96B-'36C,967*(0.6D,96E'.)!6F,96
?G0/'&5&6F,96H-.G536B,6'3:6IJJJ6F03*(06K.*L0/)6H')'6K.*/0&&5326M-G2.*-#6
NOJJPQ6A!06M0R-03/06'+523(03)%('#6NM?EQ61*.(')6'3:6M?E)**+&,6
<5*531*.('S/&96OT96OJUVWP,6XKEYH$6IPTJTPZ[\
Tuesday, March 9, 2010
M?E6]^'(#+0
@HD VN:1.0 @SQ SN:chr20 LN:62435964 @RG ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891 @RG ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891 read_28833_29006_6945 99 chr20 28833 20 10M1D25M = 28993 195 \ AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG <<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<< NM:i:1 RG:Z:L1 read_28701_28881_323b 147 chr20 28834 30 35M = 28701 -168 \ ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA <<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<< MF:i:18 RG:Z:L2
Tuesday, March 9, 2010
CIGAR Format
M, match/mismatch
I, insertion
D, deletion
S, softclip
...
Ref: GCATTCAGATGCAGTACGC
Read: ccTCAG--GCAGTAgtg
Pos: 5
CIGAR: 2S4M3D6M3S
Tuesday, March 9, 2010
BAM
Binary representation of SAM
Compressed using BGZF library
Reduces "le storage to ~25%
Tuesday, March 9, 2010
Samtools
Library and software package (C, Java)
Creating, sorting, indexing SAM & BAM
Visualizing alignments in command
SNP calling
Short indel detection
Tuesday, March 9, 2010
<]H61*.(')!"#$%6W6A!063'(06*16)!06/!.*(*&*(06N0,2,6/!.[96/!._96/!.O`.'3:*(Q6*.6&/'a*+:6N0,2,6
&/'a*+:IJbUIQ,
!"#$%&'(#'6W6A!06&)'.S326#*&5S*36*16)!0610')-.06536)!06/!.*(*&*(06*.6&/'a*+:,6A!06c.&)6
G'&06536'6/!.*(*&*(065&63-(G0.0:6J,
!"#$%)*+6W6A!0603:5326#*&5S*36*16)!0610')-.06536)!06/!.*(*&*(06*.6&/'a*+:,6A!06
/!.*(]3:6G'&065&63*)653/+-:0:6536)!06:5&#+'>6*16)!0610')-.0,6
@*.60^'(#+096)!06c.&)6IJJ6G'&0&6*16'6/!.*(*&*(06'.06:0c30:6'&6/!.*(M)'.)dJ96
/!.*(]3:dIJJ96'3:6&#'36)!06G'&0&63-(G0.0:6JWPP,
http://genome.ucsc.edu/FAQ/FAQformat.html
track name=pairedReads description="Clone Paired Reads" useScore=1chr22 1000 5000
Tuesday, March 9, 2010
BED continued
strand - Defines the strand - either '+' or '-'.
thickStart - The starting position at which the feature is drawn thickly (for example, the start
codon in gene displays).
thickEnd - The ending position at which the feature is drawn thickly (for example, the stop codon
in gene displays).
itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is
set to "On", this RBG value will determine the display color of the data contained in this BED
line. NOTE: It is recommended that a simple color scheme (eight colors or less) be used with this
attribute to avoid overwhelming the color resources of the Genome Browser and your Internet
browser.
blockCount - The number of blocks (exons) in the BED line.
blockSizes - A comma-separated list of the block sizes. The number of items in this list should
correspond to blockCount.
blockStarts - A comma-separated list of block starts. All of the blockStart positions should be
calculated relative to chromStart. The number of items in this list should correspond to
blockCount.
track name=pairedReads description="Clone Paired Reads" useScore=1chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601
Tuesday, March 9, 2010
WIG format
Wiggle format (WIG) allows the display of continuous-valued data in a track format
Tuesday, March 9, 2010
WIG File Example
Variable step Fixed step
variableStep chrom=chr2300701 12.5300702 12.5300703 12.5300704 12.5300705 12.5is equivalent to:
variableStep chrom=chr2 span=5300701 12.5
fixedStep chrom=chr3 start=400601 step=100112233
Tuesday, March 9, 2010
File Format Information
http://genome.ucsc.edu/FAQ/FAQformat.html
Tuesday, March 9, 2010
Galaxy
Enabling genome-wide tools for biologists without programming:
- FastQ conversions
- Mapping
- Samtools
http://galaxy.psu.edu/Tuesday, March 9, 2010
Visualization
Integrated Genome Viewer (Broad Inst.)
Custom tracks at UCSC Genome Browser
Tuesday, March 9, 2010
UCSC Genome Browser
Recently introduced new formats for e#cient viewing of large data sets:
- BedGraph
- BigWig
Add as custom tracks
Tuesday, March 9, 2010
Integrated Genome Viewer
Imports many mentioned formats (SAM, BAM, BED etc)
Excellent for visualization of RNA-Sequencing or ChIP-sequencing data
Can also download/Visualize data from their server already prepared
Tuesday, March 9, 2010
Examples
Tuesday, March 9, 2010
Data Repositories
Short Read Archive (fastq)http://www.ncbi.nlm.nih.gov/sra
Gene Expression Omnibus (bed, wig, fastq)http://www.ncbi.nlm.nih.gov/geo/
Processed Data on author websites (bed, wig)
Tuesday, March 9, 2010
SEQAnswers, an active forum for discussions on next-generation sequencing methods and bioinformatics
http://seqanswers.com/Tuesday, March 9, 2010
Summary
Data: SRA or GEO
Process data: Galaxy
Visualization: IGV or UCSC GB
Tuesday, March 9, 2010