bioinformatics - karolinska...
TRANSCRIPT
Bioinformatics in next generation sequencing projects
Rickard SandbergAssistant ProfessorDepartment of Cell and Molecular BiologyKarolinska Institutet
May 2013
Thursday, May 16, 13
Standard sequence library generation
Thursday, May 16, 13
Illumina Sequencing Technology
Thursday, May 16, 13
Illumina (Solexa) Sequencing
Thursday, May 16, 13
Illumina paired-end and index-read sequencing
Thursday, May 16, 13
Once sequenced the problembecomes computational
Computational analyses is the bottleneck• Rapid improvement in sequencing• Still need for customized analysis for most projects
Thursday, May 16, 13
Overview of computational analyses
Image analysisBase calling
Primary Analyses: Mapping(Assembly)
Data typespeci!c analyses(e.g. peak calling,
calculate expression)
Custom projectspeci!c analyses
ChIP-Seq peak calling
RNA-Seq expression levelsgenome sequence
assembled contig
Thursday, May 16, 13
Preliminary Analyses
Raw Image (TB)
Platform-specific analysis using the vendors programs
Sequences and Quality scoresText File (GB)
Real Time Analysis
Thursday, May 16, 13
Sequenced reads
>EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCCFasta file:
Fastq file:
SOLiD
csfasta file>1_39_146_F3T22100200202311030112002022222002021>1_39_194_F3T11022322003020303320012223122202221
SOLiD, QV file>1_39_146_F314 6 21 27 5 18 6 15 22 27 18 17 14 18 26 15 24 19 18 18 8 20 17 12 20 6 14 13 23 6 11 12 7 13 4 >1_39_194_F326 27 16 27 23 22 23 25 22 10 5 21 4 17 20 26 26 17 25 27 23 25 14 24 26 4 4 4 4 4 4 4 4 4 14
GAACTCTGCCTTTTTCAGTGATGAGGAAAGGAGTTCTCTCTGGTCCCCAG
aaab^_U_aa [U [ _Z ] a `WU_^X`GT^_ \ TM^ ^ \ ___ \ Z \ YQVVXUBBBB
Read identifier
Quality scores
@HWI - EAS269:1:120:1786:18#0/1
+HWI - EAS269:1:120:1786:18#0/1
Thursday, May 16, 13
Phred Quality Score, Q
Each base call has an estimate of the probability of being wrong (error probability, p)
Q = -10 * log10(p)
Phred Quality Score Probability of incorrect base call Base call accuracy10 1 in 10 90 %20 1 in 100 99 %30 1 in 1000 99.9 %40 1 in 10000 99.99 %50 1 in 100000 99.999 %
Thursday, May 16, 13
FastQ encodings
Thursday, May 16, 13
Fastq quality control (FastQC)
http://www.youtube.com/watch?v=bz93ReOv87YVideo tutorial:
Thursday, May 16, 13
Quality scores for each sequence position
Thursday, May 16, 13
Quality scores for each sequence position:A good run
Thursday, May 16, 13
GC for reads
Thursday, May 16, 13
Percent A,C,G,T at each position
Thursday, May 16, 13
Relative enrichment of kmers
Thursday, May 16, 13
Overview of computational analyses
Image analysisBase calling
Primary Analyses: MappingAssembly
Data typespeci!c analyses(e.g. peak calling,
calculate expression)
Custom projectspeci!c analyses
ChIP-Seq peak calling
RNA-Seq expression levelsgenome sequence
assembled contig
Thursday, May 16, 13
Short Read Assembly
Velvet and SOAPdenovode novo genomic assembler specially designed for short read sequencing technologies
Nature 2009
Thursday, May 16, 13
Two principal approaches for transcriptome reconstruction
Thursday, May 16, 13
Genome-independent transcriptome reconstruction
Garbherr et al. Nature Biotechnology, July 2011
Default k = 25
Thursday, May 16, 13
Finding novel non-annotated genes or transcript variants
Thursday, May 16, 13
Mapping of millions of short reads
Task: Map millions of short sequences (25-100 nt) onto a genome (3 000 Mbp ) or transcriptome
Mismatches (sequencing errors and SNPs)
Unique / Repetitive matches
Indels (Normal variation, CNVs)
Large rearrangements (translocations)
BLAST, BLAT tools not designed for these tasks
Thursday, May 16, 13
Mapping of RNA-Seq reads
Garber et al. 2011 Nat Methods
STAR
Thursday, May 16, 13
Genome Chromosome Fasta Files
+
Known and putative splice junctions Fasta File
2. map reads towardsgenome + junction compilation
GTAAGT-----------AG Exon n+1
1. compile sets of junctions
Exon n
Mapping of splice junctions
Thursday, May 16, 13
Tophat !rst MethodIdentifying the transcriptome
A B C identify candidate exons
via genomic mapping
A B C A B C Generate possible
pairings of exons
Align “unmappable”
reads to possible junctions
A B C A B C
Thursday, May 16, 13
Longer readsLonger reads
GATGTTCTCAGTGTCC GATGTAATCAGTGTCC AACCCTCTCAGTGTCC
>HWI-EAS229_75_30DY0AAXX:7:1:0:949
Very long (100Kb+) intron
By segmenting the long reads, and mapping the segments independently, we can
look harder for junctions we might have missed with shorter reads
Running time
independent of
intron size
Thursday, May 16, 13
Mapping to transcriptomeExons 5’UTR 3’UTRIntronsGene:
DNA (genome)W
C
pre-mRNA
Transcription
AAAAA
RNA processing (splicing, polyadenylation)
mRNA AAAAA
Exons 5’UTR 3’UTRIntronsGene:
DNA (genome)W
C
Thursday, May 16, 13
Microexons and junction coverage
Exons 5’UTR 3’UTRIntronsGene:
DNA (genome)W
C
2 or more splice junctions within the same read
in-house mapping tophat mapping
Thursday, May 16, 13
Microexons and junction coverage
Exons 5’UTR 3’UTRIntronsGene:
DNA (genome)W
C
2 or more splice junctions within the same read
in-house mapping tophat mapping
Different read length will have different problems!Thursday, May 16, 13
Mapping'speed 308'M'reads'/'hour%'uniquely'mapping 60%'multimapping 25%'unmapped 15
Example of STAR aligned single-cell RNA-Seq data
281 719 splice junctions279 356 with GT/AG 2 123 with GC/AG 215 with AT/AC
Thursday, May 16, 13
Storing mapped Alignments
Formats for storing alignments should include:
genomic coordinates
mismatches, insertion, deletions etc.
quality information
Thursday, May 16, 13
Samtools
Sequence Alignment Map (SAM)
Generic Alignment format
Supports long and short reads
Human readable, "exible and compact
Emerging standard
h"p://samtools.sourceforge.net/
Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. BioinformaScs, 25, 2078-‐9. [PMID: 19505943]
Thursday, May 16, 13
SAM Example
16 chr Y 616000 255 22M731N28M
* 0 0 ATTTCGACCATGATCATCGAACCTTCCCCTGGATCCACTTCCACGATCAC
#9 ; -7 +2@4 : 2=20 - 14= : ><?< ; : BB? : 4<BB?ABBBBABCBBBBC=BB NM: i : 0
XS: A:-
Bit field, where 16
means reverse strand
Start position
Alignment structure. Here: 22 aligned bases,
then 731 bases intron, then 28 aligned bases
HWI - EAS269:1:114:1242:1582#0
Thursday, May 16, 13
CIGAR Format
M, match/mismatch
I, insertion
D, deletion
S, softclip
...
Ref: GCATTCAGATGCAGTACGC
Read: ccTCAG--GCAGTAgtg
Pos: 5
CIGAR: 2S4M3D6M3S
50M
Thursday, May 16, 13
Samtools for SAM/BAM !les
Library and software package (C, Java)
Creating, sorting, indexing SAM & BAM
Visualizing alignments in command
SNP calling
Short indel detection
BAM (Binary representation of SAM) ~25% #le size reduction
Thursday, May 16, 13
Read mapping statistics
e.g. using RSeQC (package)
GC content (%)
Den
sity
of R
eads
0 20 40 60 80 100
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●●
● ● ● ●●
●●
●●
●●
●● ●
●
●
● ● ●● ●
●●
●
0 10 20 30 40
0.15
0.20
0.25
0.30
0.35
0.40
0.45
Position of Read
Nuc
leot
ide
Freq
uenc
y
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ●●
●●
● ●●
●●
●● ● ● ● ●
● ● ● ●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
● ●
●●
●●
●●
●● ●
● ● ● ●●
● ● ● ● ● ● ● ● ●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●● ● ● ● ●
●●
● ●● ● ●
● ●● ●
●
● ● ●●
●
●●
●
●
●
●
●
ATGC
Thursday, May 16, 13
Read mapping statistics:Read mapping across genes
0 20 40 60 80 100
2000
4000
6000
8000
1000
0
percentile of gene body (5'−>3')
read
num
ber
Thursday, May 16, 13
Read mapping statistics
partial_novel 2%
complete_novel 9%
known 89%
splicing junctions
Thursday, May 16, 13
Read mapping statistics: duplicate and unique reads
0 100 200 300 400 500
Frequency
Num
ber o
f Rea
ds (l
og10
)●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●●
●●
●
●
●
●
●●
●●●
●
●●
●●
●
●●●●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●●● ●
●●●●
●
●
●●●
●
●● ●●
●
●●●● ●●
●
●●●●●
●
●●●● ● ●
●
● ●● ●● ●
●
● ● ● ● ●
●
●
●
● ●● ●● ●
●
Sequence−baseMapping−base
01
23
45
23
983
Rea
ds %
Thursday, May 16, 13
Read mapping statistics: q values on mapped reads
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
Position of Read
313233343536373839404142434445464748495051525354555657585960616263646566676869707172
Phre
d Q
ualit
y Sc
ore
Thursday, May 16, 13
Overview of computational analyses
Image analysisBase calling
Primary Analyses: MappingAssembly
Data typespeci!c analyses(e.g. peak calling,
calculate expression)
Custom projectspeci!c analyses
ChIP-Seq peak calling
RNA-Seq expression levelsgenome sequence
assembled contig
Thursday, May 16, 13
Visualization
Integrated Genome Viewer (Broad Inst.)
Custom tracks at UCSC Genome Browser
Thursday, May 16, 13
Peak characteristics differ with signal
Thursday, May 16, 13
Peak characteristics differ with signal
H3K4me3: Sharp promoter peaksH3K36me3: Broad transcription elongation signal
Thursday, May 16, 13
Important !le formats
Sequences: FastQ
Aligned reads: SAM/BAM
Genome annotations: Bed, Gff
Coverage: Wig, (Tdf )
http://genome.ucsc.edu/FAQ/FAQformat.html
Thursday, May 16, 13
BED format
chrom -‐ The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
chromStart -‐ The starSng posiSon of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
chromEnd -‐ The ending posiSon of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature.
For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-‐99.
http://genome.ucsc.edu/FAQ/FAQformat.html
track name=pairedReads description="Clone Paired Reads" useScore=1chr22 1000 5000
Thursday, May 16, 13
BED continued
strand - Defines the strand - either '+' or '-'.thickStart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays).thickEnd - The ending position at which the feature is drawn thickly (for example, the stop codon in gene displays).itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to "On", this RBG value will determine the display color of the data contained in this BED line. NOTE: It is recommended that a simple color scheme (eight colors or less) be used with this attribute to avoid overwhelming the color resources of the Genome Browser and your Internet browser.blockCount - The number of blocks (exons) in the BED line.blockSizes - A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.blockStarts - A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount.
track name=pairedReads description="Clone Paired Reads" useScore=1chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601
Thursday, May 16, 13
Variable step Fixed step
variableStep chrom=chr2300701 12.5300702 12.5300703 12.5300704 12.5300705 12.5is equivalent to:variableStep chrom=chr2 span=5300701 12.5
fixedStep chrom=chr3 start=400601 step=100112233
WIG format (coverage format)
Wiggle format (WIG) allows the display of continuous-valued data in a track format
Thursday, May 16, 13
Data Repositories
Short Read Archive (fastq) [discontinued!]http://www.ncbi.nlm.nih.gov/sraEuropean Nucleotide Archive
Gene Expression Omnibus (bed, wig, fastq)http://www.ncbi.nlm.nih.gov/geo/
Thursday, May 16, 13
SEQAnswers, an active forum for discussions on next-generation sequencing methods and bioinformatics
http://seqanswers.com/Thursday, May 16, 13
Thursday, May 16, 13
Garbherr et al. Nature Biotechnology, July 2011
Genome-independent transcriptome reconstruction: accuracy and coverage
Thursday, May 16, 13
Genome-independent transcriptome reconstruction: accuracy and coverage
Garbherr et al. Nature Biotechnology, July 2011
Thursday, May 16, 13