![Page 1: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/1.jpg)
Bioinformatics in next generation sequencing projects
Rickard SandbergAssistant ProfessorDepartment of Cell and Molecular BiologyKarolinska Institutet
May 2013
Thursday, May 16, 13
![Page 2: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/2.jpg)
Standard sequence library generation
Thursday, May 16, 13
![Page 3: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/3.jpg)
Illumina Sequencing Technology
Thursday, May 16, 13
![Page 4: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/4.jpg)
Illumina (Solexa) Sequencing
Thursday, May 16, 13
![Page 5: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/5.jpg)
Illumina paired-end and index-read sequencing
Thursday, May 16, 13
![Page 6: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/6.jpg)
Once sequenced the problembecomes computational
Computational analyses is the bottleneck• Rapid improvement in sequencing• Still need for customized analysis for most projects
Thursday, May 16, 13
![Page 7: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/7.jpg)
Overview of computational analyses
Image analysisBase calling
Primary Analyses: Mapping(Assembly)
Data typespeci!c analyses(e.g. peak calling,
calculate expression)
Custom projectspeci!c analyses
ChIP-Seq peak calling
RNA-Seq expression levelsgenome sequence
assembled contig
Thursday, May 16, 13
![Page 8: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/8.jpg)
Preliminary Analyses
Raw Image (TB)
Platform-specific analysis using the vendors programs
Sequences and Quality scoresText File (GB)
Real Time Analysis
Thursday, May 16, 13
![Page 9: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/9.jpg)
Sequenced reads
>EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCCFasta file:
Fastq file:
SOLiD
csfasta file>1_39_146_F3T22100200202311030112002022222002021>1_39_194_F3T11022322003020303320012223122202221
SOLiD, QV file>1_39_146_F314 6 21 27 5 18 6 15 22 27 18 17 14 18 26 15 24 19 18 18 8 20 17 12 20 6 14 13 23 6 11 12 7 13 4 >1_39_194_F326 27 16 27 23 22 23 25 22 10 5 21 4 17 20 26 26 17 25 27 23 25 14 24 26 4 4 4 4 4 4 4 4 4 14
GAACTCTGCCTTTTTCAGTGATGAGGAAAGGAGTTCTCTCTGGTCCCCAG
aaab^_U_aa [U [ _Z ] a `WU_^X`GT^_ \ TM^ ^ \ ___ \ Z \ YQVVXUBBBB
Read identifier
Quality scores
@HWI - EAS269:1:120:1786:18#0/1
+HWI - EAS269:1:120:1786:18#0/1
Thursday, May 16, 13
![Page 10: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/10.jpg)
Phred Quality Score, Q
Each base call has an estimate of the probability of being wrong (error probability, p)
Q = -10 * log10(p)
Phred Quality Score Probability of incorrect base call Base call accuracy10 1 in 10 90 %20 1 in 100 99 %30 1 in 1000 99.9 %40 1 in 10000 99.99 %50 1 in 100000 99.999 %
Thursday, May 16, 13
![Page 11: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/11.jpg)
FastQ encodings
Thursday, May 16, 13
![Page 12: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/12.jpg)
Fastq quality control (FastQC)
http://www.youtube.com/watch?v=bz93ReOv87YVideo tutorial:
Thursday, May 16, 13
![Page 13: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/13.jpg)
Quality scores for each sequence position
Thursday, May 16, 13
![Page 14: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/14.jpg)
Quality scores for each sequence position:A good run
Thursday, May 16, 13
![Page 15: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/15.jpg)
GC for reads
Thursday, May 16, 13
![Page 16: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/16.jpg)
Percent A,C,G,T at each position
Thursday, May 16, 13
![Page 17: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/17.jpg)
Relative enrichment of kmers
Thursday, May 16, 13
![Page 18: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/18.jpg)
Overview of computational analyses
Image analysisBase calling
Primary Analyses: MappingAssembly
Data typespeci!c analyses(e.g. peak calling,
calculate expression)
Custom projectspeci!c analyses
ChIP-Seq peak calling
RNA-Seq expression levelsgenome sequence
assembled contig
Thursday, May 16, 13
![Page 19: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/19.jpg)
Short Read Assembly
Velvet and SOAPdenovode novo genomic assembler specially designed for short read sequencing technologies
Nature 2009
Thursday, May 16, 13
![Page 20: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/20.jpg)
Two principal approaches for transcriptome reconstruction
Thursday, May 16, 13
![Page 21: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/21.jpg)
Genome-independent transcriptome reconstruction
Garbherr et al. Nature Biotechnology, July 2011
Default k = 25
Thursday, May 16, 13
![Page 22: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/22.jpg)
Finding novel non-annotated genes or transcript variants
Thursday, May 16, 13
![Page 23: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/23.jpg)
Mapping of millions of short reads
Task: Map millions of short sequences (25-100 nt) onto a genome (3 000 Mbp ) or transcriptome
Mismatches (sequencing errors and SNPs)
Unique / Repetitive matches
Indels (Normal variation, CNVs)
Large rearrangements (translocations)
BLAST, BLAT tools not designed for these tasks
Thursday, May 16, 13
![Page 24: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/24.jpg)
Mapping of RNA-Seq reads
Garber et al. 2011 Nat Methods
STAR
Thursday, May 16, 13
![Page 25: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/25.jpg)
Genome Chromosome Fasta Files
+
Known and putative splice junctions Fasta File
2. map reads towardsgenome + junction compilation
GTAAGT-----------AG Exon n+1
1. compile sets of junctions
Exon n
Mapping of splice junctions
Thursday, May 16, 13
![Page 26: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/26.jpg)
Tophat !rst MethodIdentifying the transcriptome
A B C identify candidate exons
via genomic mapping
A B C A B C Generate possible
pairings of exons
Align “unmappable”
reads to possible junctions
A B C A B C
Thursday, May 16, 13
![Page 27: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/27.jpg)
Longer readsLonger reads
GATGTTCTCAGTGTCC GATGTAATCAGTGTCC AACCCTCTCAGTGTCC
>HWI-EAS229_75_30DY0AAXX:7:1:0:949
Very long (100Kb+) intron
By segmenting the long reads, and mapping the segments independently, we can
look harder for junctions we might have missed with shorter reads
Running time
independent of
intron size
Thursday, May 16, 13
![Page 28: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/28.jpg)
Mapping to transcriptomeExons 5’UTR 3’UTRIntronsGene:
DNA (genome)W
C
pre-mRNA
Transcription
AAAAA
RNA processing (splicing, polyadenylation)
mRNA AAAAA
Exons 5’UTR 3’UTRIntronsGene:
DNA (genome)W
C
Thursday, May 16, 13
![Page 29: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/29.jpg)
Microexons and junction coverage
Exons 5’UTR 3’UTRIntronsGene:
DNA (genome)W
C
2 or more splice junctions within the same read
in-house mapping tophat mapping
Thursday, May 16, 13
![Page 30: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/30.jpg)
Microexons and junction coverage
Exons 5’UTR 3’UTRIntronsGene:
DNA (genome)W
C
2 or more splice junctions within the same read
in-house mapping tophat mapping
Different read length will have different problems!Thursday, May 16, 13
![Page 31: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/31.jpg)
Mapping'speed 308'M'reads'/'hour%'uniquely'mapping 60%'multimapping 25%'unmapped 15
Example of STAR aligned single-cell RNA-Seq data
281 719 splice junctions279 356 with GT/AG 2 123 with GC/AG 215 with AT/AC
Thursday, May 16, 13
![Page 32: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/32.jpg)
Storing mapped Alignments
Formats for storing alignments should include:
genomic coordinates
mismatches, insertion, deletions etc.
quality information
Thursday, May 16, 13
![Page 33: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/33.jpg)
Samtools
Sequence Alignment Map (SAM)
Generic Alignment format
Supports long and short reads
Human readable, "exible and compact
Emerging standard
h"p://samtools.sourceforge.net/
Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. BioinformaScs, 25, 2078-‐9. [PMID: 19505943]
Thursday, May 16, 13
![Page 34: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/34.jpg)
SAM Example
16 chr Y 616000 255 22M731N28M
* 0 0 ATTTCGACCATGATCATCGAACCTTCCCCTGGATCCACTTCCACGATCAC
#9 ; -7 +2@4 : 2=20 - 14= : ><?< ; : BB? : 4<BB?ABBBBABCBBBBC=BB NM: i : 0
XS: A:-
Bit field, where 16
means reverse strand
Start position
Alignment structure. Here: 22 aligned bases,
then 731 bases intron, then 28 aligned bases
HWI - EAS269:1:114:1242:1582#0
Thursday, May 16, 13
![Page 35: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/35.jpg)
CIGAR Format
M, match/mismatch
I, insertion
D, deletion
S, softclip
...
Ref: GCATTCAGATGCAGTACGC
Read: ccTCAG--GCAGTAgtg
Pos: 5
CIGAR: 2S4M3D6M3S
50M
Thursday, May 16, 13
![Page 36: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/36.jpg)
Samtools for SAM/BAM !les
Library and software package (C, Java)
Creating, sorting, indexing SAM & BAM
Visualizing alignments in command
SNP calling
Short indel detection
BAM (Binary representation of SAM) ~25% #le size reduction
Thursday, May 16, 13
![Page 37: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/37.jpg)
Read mapping statistics
e.g. using RSeQC (package)
GC content (%)
Den
sity
of R
eads
0 20 40 60 80 100
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●●
● ● ● ●●
●●
●●
●●
●● ●
●
●
● ● ●● ●
●●
●
0 10 20 30 40
0.15
0.20
0.25
0.30
0.35
0.40
0.45
Position of Read
Nuc
leot
ide
Freq
uenc
y
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ● ●●
●●
● ●●
●●
●● ● ● ● ●
● ● ● ●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
● ●
●●
●●
●●
●● ●
● ● ● ●●
● ● ● ● ● ● ● ● ●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●● ● ● ● ●
●●
● ●● ● ●
● ●● ●
●
● ● ●●
●
●●
●
●
●
●
●
ATGC
Thursday, May 16, 13
![Page 38: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/38.jpg)
Read mapping statistics:Read mapping across genes
0 20 40 60 80 100
2000
4000
6000
8000
1000
0
percentile of gene body (5'−>3')
read
num
ber
Thursday, May 16, 13
![Page 39: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/39.jpg)
Read mapping statistics
partial_novel 2%
complete_novel 9%
known 89%
splicing junctions
Thursday, May 16, 13
![Page 40: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/40.jpg)
Read mapping statistics: duplicate and unique reads
0 100 200 300 400 500
Frequency
Num
ber o
f Rea
ds (l
og10
)●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●●
●●
●
●
●
●
●●
●●●
●
●●
●●
●
●●●●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●●● ●
●●●●
●
●
●●●
●
●● ●●
●
●●●● ●●
●
●●●●●
●
●●●● ● ●
●
● ●● ●● ●
●
● ● ● ● ●
●
●
●
● ●● ●● ●
●
Sequence−baseMapping−base
01
23
45
23
983
Rea
ds %
Thursday, May 16, 13
![Page 41: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/41.jpg)
Read mapping statistics: q values on mapped reads
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
Position of Read
313233343536373839404142434445464748495051525354555657585960616263646566676869707172
Phre
d Q
ualit
y Sc
ore
Thursday, May 16, 13
![Page 42: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/42.jpg)
Overview of computational analyses
Image analysisBase calling
Primary Analyses: MappingAssembly
Data typespeci!c analyses(e.g. peak calling,
calculate expression)
Custom projectspeci!c analyses
ChIP-Seq peak calling
RNA-Seq expression levelsgenome sequence
assembled contig
Thursday, May 16, 13
![Page 43: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/43.jpg)
Visualization
Integrated Genome Viewer (Broad Inst.)
Custom tracks at UCSC Genome Browser
Thursday, May 16, 13
![Page 44: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/44.jpg)
Peak characteristics differ with signal
Thursday, May 16, 13
![Page 45: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/45.jpg)
Peak characteristics differ with signal
H3K4me3: Sharp promoter peaksH3K36me3: Broad transcription elongation signal
Thursday, May 16, 13
![Page 46: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/46.jpg)
Important !le formats
Sequences: FastQ
Aligned reads: SAM/BAM
Genome annotations: Bed, Gff
Coverage: Wig, (Tdf )
http://genome.ucsc.edu/FAQ/FAQformat.html
Thursday, May 16, 13
![Page 47: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/47.jpg)
BED format
chrom -‐ The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
chromStart -‐ The starSng posiSon of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
chromEnd -‐ The ending posiSon of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature.
For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-‐99.
http://genome.ucsc.edu/FAQ/FAQformat.html
track name=pairedReads description="Clone Paired Reads" useScore=1chr22 1000 5000
Thursday, May 16, 13
![Page 48: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/48.jpg)
BED continued
strand - Defines the strand - either '+' or '-'.thickStart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays).thickEnd - The ending position at which the feature is drawn thickly (for example, the stop codon in gene displays).itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to "On", this RBG value will determine the display color of the data contained in this BED line. NOTE: It is recommended that a simple color scheme (eight colors or less) be used with this attribute to avoid overwhelming the color resources of the Genome Browser and your Internet browser.blockCount - The number of blocks (exons) in the BED line.blockSizes - A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.blockStarts - A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount.
track name=pairedReads description="Clone Paired Reads" useScore=1chr22 2000 6000 cloneB 900 - 2000 6000 0 2 433,399, 0,3601
Thursday, May 16, 13
![Page 49: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/49.jpg)
Variable step Fixed step
variableStep chrom=chr2300701 12.5300702 12.5300703 12.5300704 12.5300705 12.5is equivalent to:variableStep chrom=chr2 span=5300701 12.5
fixedStep chrom=chr3 start=400601 step=100112233
WIG format (coverage format)
Wiggle format (WIG) allows the display of continuous-valued data in a track format
Thursday, May 16, 13
![Page 50: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/50.jpg)
Data Repositories
Short Read Archive (fastq) [discontinued!]http://www.ncbi.nlm.nih.gov/sraEuropean Nucleotide Archive
Gene Expression Omnibus (bed, wig, fastq)http://www.ncbi.nlm.nih.gov/geo/
Thursday, May 16, 13
![Page 51: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/51.jpg)
SEQAnswers, an active forum for discussions on next-generation sequencing methods and bioinformatics
http://seqanswers.com/Thursday, May 16, 13
![Page 52: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/52.jpg)
Thursday, May 16, 13
![Page 53: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/53.jpg)
Garbherr et al. Nature Biotechnology, July 2011
Genome-independent transcriptome reconstruction: accuracy and coverage
Thursday, May 16, 13
![Page 54: Bioinformatics - Karolinska Institutetsandberg.cmb.ki.se/.../courses/bioinfocell/NGS_bioinformatics_2013.pdf · Bioinformatics in next generation ... 14 6 21 27 5 18 6 15 22 27 18](https://reader031.vdocuments.net/reader031/viewer/2022022515/5aff3bb87f8b9a994d901794/html5/thumbnails/54.jpg)
Genome-independent transcriptome reconstruction: accuracy and coverage
Garbherr et al. Nature Biotechnology, July 2011
Thursday, May 16, 13