bio-informatic analysis of rnaseq...
TRANSCRIPT
![Page 1: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/1.jpg)
bio-informatic analysisof RNASeq data
Christophe Klopp / www.sigenae.org, bioinfo.genotoul.fr
![Page 2: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/2.jpg)
2
Overview
Transcriptome and transcription variability Sequencing techniques Usual questions Data quality control Read spliced alignment Expression quantification Novel gene and transcript identification
![Page 3: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/3.jpg)
3
Sigenae
7 engineers work in farm animal genomics 30 running projects > 400 publications (citing the team or having
a team member in the authors)
![Page 4: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/4.jpg)
4
Bioinfo Genotoul
12 engineers > 4,000 cpus, 1Pb disk space 10 training sessions > 20 running projects
![Page 5: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/5.jpg)
5
Common software developments
http://ng6.toulouse.inra.fr/
http://bioinfo.genotoul.fr/jvenn/example.html
http://ngspipelines.toulouse.inra.fr:9019/
http://rnaspace.org/
![Page 6: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/6.jpg)
6
Definitions
RNA-Seq :RNA-seq (RNA Sequencing), also called Whole Transcriptome Shotgun Sequencing (WTSS), is a technology that uses the capabilities of next-generation sequencing to reveal a snapshot of RNA presence and quantity from a genome at a given moment in time.
http://en.wikipedia.org/wiki/RNA-Seq
![Page 7: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/7.jpg)
7
RNA-Seq aims
Find the structures and functions of expressed genes and transcripts (possible splice forms),
Measure the expression levels usually to find differentially expressed transcripts (explaining the phenotype),
Find polymorphisms in the transcripts : SSR (short sequence repeat), SNP (Single nucleotide polymorphism), INDEL (Insertion / Deletion).
![Page 8: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/8.jpg)
8
RNA-Seq limitations
No sequencer is able, today, to produce large quantities of reliable sequences corresponding to full length transcripts :
HiSeq produces short reads MiSeq, PGM, proton produce lower read
numbers (quantities) PacBio reads have an high error rate and low
through-put
![Page 9: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/9.jpg)
9
Hands-on
In small groups, define transcription and the different products produced by transcription.
Group the products depending on their features.
List the different forms of variability found in transcription products and discuss their impact.
![Page 10: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/10.jpg)
10
GENCODE view
![Page 11: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/11.jpg)
Lnc-RNA counts in GENCODE
![Page 12: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/12.jpg)
12
Transcription variability Number of transcripts
possible variation factor between transcripts: 106 or more,
expression variation between samples (biological repeats, technical repeats).
Many types of transcripts mRNA, ncRNA,...
Isoforms (with non canonical splice sites) Intron retention
The splicing is not always completed
Is a new isoform or a transcription error
Transcript decay (degradation) Allele specific expression Gene fusion (found in cancer cells)
http://www.nature.com/emboj/journal/v25/n5/fig_tab/7601023a_F2.html
![Page 13: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/13.jpg)
13
Sequencers
![Page 14: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/14.jpg)
14
Technological variability
Types of reads Long (> 200 bp … 40kb) Short (16 bp ...200 bp)
Number of reads millions … billions
Strand specific or not Paired or not Different biases
![Page 15: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/15.jpg)
15
Illumina library preparation
![Page 16: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/16.jpg)
16
Illumina Sequencing protocol
![Page 17: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/17.jpg)
For small or empty inserts
Raw read 1
Raw read 2
Adapter
PolyA
Raw read 1
Raw read 2
![Page 18: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/18.jpg)
18
The output : fastq file
![Page 19: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/19.jpg)
19
Fastq file format
![Page 20: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/20.jpg)
20
Usual questions
How long should my reads be ? Single-end or paired-end ? Is one pooled sample enough? How many replicates ? Technical or/and biological replicates ? How many reads for each sample? How many conditions for a full transcriptome ?
![Page 21: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/21.jpg)
21
ENCODE answers in 2011 RNA-Seq is not a mature technology.
Experiments should be performed with two or more biological replicates, unless there is a compelling reason why this is impractical or wasteful
A typical R2 (Pearson) correlation of gene expression (RPKM) between two biological replicates, for RNAs that are detected in both samples using RPKM or read counts, should be between 0.92 to 0.98. Experiments with biological correlations that fall below 0.9 should be either be repeated or explained.
Between 30M and 100M reads per sample depending on the study.
NB. Guidelines for the information to publish with the data.
http://encodeproject.org/ENCODE/dataStandards.html
![Page 22: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/22.jpg)
22
Statistician answers
Less reads More samples
http://www.sciencedirect.com/science/article/pii/S0378111914013869
![Page 23: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/23.jpg)
23
Analysis workflow
Data quality controlData quality control
Spliced mapping
Gene and transcript discovery
Quantification
![Page 24: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/24.jpg)
24
Verifying RNASeq raw dataFastQC :
http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
● Import of data from BAM, SAM or FastQ files
● quick overview
● Summary graphs and tables to quickly assess your data
● Export of results to an HTML report
● Offline operation to allow automated generation of reports
● Color code to check quickly the quality
![Page 25: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/25.jpg)
25
Quality control Technical characteristics conformity Contamination search Classical RNA-Seq biases
Example : hexamer random priming
![Page 26: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/26.jpg)
26
Bias impact on alignment Orange = reads start sites Blue = coverage
226
5074
9810
18 3442 58
66 8290 106
114122
130138
146154
162170
178186
194202
210218
226234
242250
258266
274282
290298
306314
322330
338346
354362
370378
386394
402410
418426
434442
450458
466474
482490
498506
514522
530538
546554
562570
578586
594602
610618
626634
642650
658666
674682
690698
706714
722730
738746
754762
770778
0
200
400
600
800
1000
1200
0
10
20
30
40
50
60
70
80
depth
starts
![Page 27: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/27.jpg)
27
Transcript length bias
– the differential expression of longer transcripts is more likely to be identified than that of shorter transcripts
![Page 28: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/28.jpg)
28
Hands-on
Run fastqc (fastqc)on one of the fastq files found on you USB stick
In groups explain the different graphics produced by fastqc
![Page 29: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/29.jpg)
29
Elements to be checked :– Random priming effect– K-mer (polyA, polyT)
Alignment on reference for the second quality check and filtering.
A good run has :
– the expected number of reads (2x500millions / flowcell),– the expected reads length (100pb),– a random nucleotides selection and the GC%,– a high alignment rate : very few unmapped reads, pairs
mapped on opposite strands (shown in the next part).
Take home messages on quality analysis
![Page 30: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/30.jpg)
30
Analyse workflow
Data quality control
Spliced mappingSpliced mapping
Gene and transcript discovery
Quantification
![Page 31: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/31.jpg)
31
Where to find a reference genome?
Fasta file
Retrieving the genome file:
– The Genome Reference Consortium
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/
– ! NCBI chromosome naming with « | » not well supported by mapping software
– Prefer EMBL:
http://www.ensembl.org/info/data/ftp/index.html
The chromosome names should be the same in the gtf file and fasta file.
![Page 32: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/32.jpg)
32
Reference transcriptome file
What is a GTF file ?
– Tab delimited text file
– derived from GFF (General Feature Format, for description of genes and other features)
– Gene Transfer Format : http://genome.ucsc.edu/FAQ/FAQformat.html#format4
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]
The [attribute] list must begin with:
gene_id value : unique identifier for the genomic source of the sequence.
transcript_id value : unique identifier for the predicted transcript.
![Page 33: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/33.jpg)
33
Splice sites
– Canonical splice site: which accounts for more than 99% of splicing GT and AG for donor and acceptor sites
– Non-canonical site: GC-AG splice site pairs, AT-AC pairs
– Trans-splicing : splicing that joins two exons that are not within the same RNA transcript
http://en.wikipedia.org/wiki/RNA_splicing
![Page 34: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/34.jpg)
34
Spliced alignment– The recognition of exon/intron junctions can be inferred from the
reads that overlap the splicing sites. The resulting spliced reads can produce very short alignments, part of the read will not map contiguously to the reference.
→ therefore this approach requires a dedicated algorithm
– Generation :
● Sim4● Seqanswer : http://seqanswers.com/wiki/Software/list
– Idea :
● Database of potential splice junction sequences (known)● splice canonical / non canonical site search (seed then
mapping)
![Page 35: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/35.jpg)
35
Exon first vs seed extend
![Page 36: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/36.jpg)
36
http://tophat.cbcb.umd.edu/
– Aligns RNA-Seq reads to a reference genome with Bowtie
– splice junction mapper for reads without knowledges
– identify splice junctions between exons.
TopHat
http://en.wikipedia.org/wiki/List_of_RNA-Seq_bioinformatics_tools#Spliced_aligners
![Page 37: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/37.jpg)
37
TopHat initial algorithm : first step
– TopHat finds junctions by mapping
reads to the reference:
● all reads are mapped to the
reference genome using Bowtie
● reads not mapped to the genome are set aside as
IUM (initially unmapped)
● low complexity reads are discarded
● for each read : allow until 20 alignments
Trapnell C et al. Bioinformatics 2009;25:1105-1111
![Page 38: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/38.jpg)
38
– TopHat then assembles mapped reads
– Define island: aggregates mapped
reads in islands of candidate exons
● Generate potential
donor/acceptor splice sites
using neighbouring exons
– Extend islands to cover eventually splice junctions
● +/ 45 bp from reference on either side of island‐
Exon assembly process
Trapnell C et al. Bioinformatics 2009;25:1105-1111
![Page 39: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/39.jpg)
39
To map reads to splice junction :
– Enumerate all canonical donor and
acceptor sites in islands
● long (>= 75 bp) reads:
"GT‐AG","GC‐AG" and "AT‐AC" introns
● Shorter reads:
only "GT‐AG" introns
– Find all pairings which produce
GT‐AG introns between islands
● 70 bp < Intron size < 20,000 bp
Spice junction reference
Trapnell C et al. Bioinformatics 2009;25:1105-1111
![Page 40: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/40.jpg)
40
• Each possible intron is checked
against the IUM
→ seed and extend alignment
IUM alignment
Trapnell C et al. Bioinformatics 2009;25:1105-1111
![Page 41: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/41.jpg)
41
Inputs :
– bowtie2 index of the genome
ftp://ftp.cbcb.umd.edu/pub/data/bowtie_indexes/
http://bowtie-bio.sourceforge.net/index.shtml
– file fasta (.fa) of the reference or will be build by bowtie,
in the index directory
– File fastq of the reads
Command lines :
bowtie2-build <reference.fasta> <index_base>tophat [options] <index_base> <reads1_1[,...,readsN_1]><[reads1_2,...readsN_2]>
TopHat Inputs
![Page 42: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/42.jpg)
42
TopHat Options
![Page 43: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/43.jpg)
43
Please Note TopHat has a number of parameters and options, and their default values are tuned for processing mammalian RNA-Seq reads.
If you would like to use TopHat for another class of organism, we recommend setting some of the parameters with more strict, conservative values than their defaults.
Usually, setting the maximum intron size to 4 or 5 Kb is sufficient to discover most junctions while keeping the number of false positives low.
Special note on the website
![Page 44: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/44.jpg)
44
Your own junctions :
-G/--GTF <GTF2.2file>
-j/--raw-juncs <.juncs file>
--no-novel-juncs (ignored without -G/-j)
Your own insertions/deletions:
--insertions/--deletions <.juncs file>
--no-novel-indels
More topHat options
![Page 45: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/45.jpg)
45
Library types
![Page 46: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/46.jpg)
46
Outputs :● accepted_hits.bam : list of read alignments in SAM
format compressed● junctions.bed : track of junctions,
scores : number of alignments spanning the junction● insertions.bed and deletions.bed : tracks of insertions
and deletions● logs directory files
● unmapped.bam : Unmapped or multi-mapped (over the threshold) reads
● prep_reads.info : number of reads and read length for input and output
TopHat Outputs
![Page 47: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/47.jpg)
47
– SAM (Sequence Alignment/Map) format:
● Capture all of the critical information about NGS data in a single indexed and compressed file
● Sharing : data across and tools
● Generic alignment format
● SAMTOOLS: provide various
utilities for manipulating alignments in the SAM format:sorting, merging, indexing...
http://samtools.sourceforge.net/
http://picard.sourceforge.net/explain-flags.html
Sequence alignment and map
![Page 48: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/48.jpg)
48
– Extend CIGAR strings
– Example: intron de 81 bases
ERR022486.8388510 81 22 32099 255 58M81N18M = 27484 -4772 CCTTGGTCTTGCCGAAGTAGATCTCATTGAGAGTGGAGCGGATCTTGTTCTCCATTTCCTCCACCAGGCGTCCGAT :9=<==;<<><=><?>>?<?==>>?>><?>>??<AA?@AFADDD;GDGAG@GGCBE@GG?GG>GGGG?GGGGGGGG NM:i:0 XS:A:- NH:i:1
Spliced cigar line
![Page 49: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/49.jpg)
49
– BAM (Binary Alignment/Map) format:● Compressed binary representation of SAM● Greatly reduces storage space requirements to about
27% of original SAM● Bamtools: reading, writing, and manipulating BAM
files
– Bed (Browser Extensible Data) format:● tab-delimited text file that defines a feature track
http://genome.ucsc.edu/FAQ/FAQformat.html#format1● The first three required BED fields are:
<chromosome> <start> <end>● 9 additional optional BED fields
Bam & Bed
![Page 50: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/50.jpg)
50
Bed example
Chrom
Start End name scorestrand drawing
Blocks info
RGB
![Page 51: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/51.jpg)
51
Mapper comparisons
![Page 52: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/52.jpg)
52
Hands-in : spliced alignment
Index the genome file Danio_rerio.Zv9.62.dna.chromosome.22.fa with bowtie2
Align both reads paired files to the genome using tophat2
ERR022486_chr22_read1.fastq.gz ERR022486_chr22_read2.fastq.gz
ERR022488_chr22_read1.fastq.gz ERR022488_chr22_read2.fastq.gz
Parameters :
Max intron size : 5kb Number of threads : 4 Use the name of the file ERR022486 ERR022488 as output
directory name Index the accepted_hits.bam file
Count the number of alignments with samtools flagstat for ERR022486
![Page 53: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/53.jpg)
53
Hands-in : commands
bowtie2-build Danio_rerio.Zv9.62.dna.chromosome.22.fa Danio_rerio.Zv9.62_chr22
tophat -p 4 –output-dir=tophat_ERR022486 -I 5000 Danio_rerio.Zv9.62_chr22 ERR022486_read1.fastq,ERR022486_read2.fastq
samtools index ERR022486/accepted_hits.bam
samtools flagstat ERR022486/accepted_hits.bam
![Page 54: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/54.jpg)
54
Visualizing alignments on IGV
http://www.broadinstitute.org/igv/home
![Page 55: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/55.jpg)
55
– High-performance visualization tool
– Interactive exploration of large datasets
– Supports a wide variety of data types
– Documentations available
– Developed at the Broad Institute of MIT
and Harvard
Visualizing alignments on IGV
![Page 56: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/56.jpg)
56
Visualizing alignments on IGV
![Page 57: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/57.jpg)
57
Import a reference genome
Visualizing alignments on IGV
![Page 58: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/58.jpg)
58
Import your BAM Files
Visualizing alignments on IGV
![Page 59: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/59.jpg)
59
Exemple of bam and bed files visualisation
Visualizing alignments on IGV
![Page 60: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/60.jpg)
60
hands-on : IGV
Create the genome dr22 in IGV using Danio_rerio.Zv9.62.dna.chromosome.22.fa
Load the gtf file : Danio_rerio_chr22.Zv9.62.gtf
Load the bam file : ERR022486/accepted_hits.bam
![Page 61: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/61.jpg)
61
Analyse workflow
Data quality control
Spliced mapping
QuantificationQuantification
Gene and transcript discovery
![Page 62: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/62.jpg)
62
What do we want to build?
The gene / transcript description file (and corresponding fasta)
The count file
![Page 63: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/63.jpg)
63
If you have the model file
The model is presented in the GTF file (Gene Transfer Format)
Two approaches
– Gene level
– Transcript level
Tools for each approach
● htseq-count
● Cufflinks or FeatureCounts
![Page 64: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/64.jpg)
64
HTSeq-count
– Process the output from short read aligners in various formats
– Count how many reads map to each feature (in RNA-Seq, the features are typically genes)
● counting reads by genes● or consider each exon as a feature to check for alternative
splicing– Inputs:
● file with aligned sequencing reads: bam (or sam) file● list of genomic feature: gtf file
http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html
![Page 65: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/65.jpg)
65
HTSeq-count parameters– Command line :
● htseq-count [options] <sam_file> <gtf_file>● samtools view accepted_hits.bam | htseq-count
--stranded=no -m intersection-nonempty - file.gtf -q > output.htseq-count.txt &
Some options:
-m <mode> : intersection-strict or intersection-nonempty (default union)
--stranded =<yes, no, or reverse> (default yes)-t <feature type> : 3rd column in GTF file-q : quiet-h : help
![Page 66: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/66.jpg)
66
HTSeq-count output
– Output: a table with counts for each feature and a summary of reads not counted for any feature:
● no_feature: reads which couldn't be assigned to any feature
● ambiguous: reads which could have been assigned to more than one feature and hence were not counted for any of these
● not_aligned: reads in the SAM file without alignment
● alignment_not_unique: reads with more than one reported alignment. These reads are recognized from the NH optional SAM field tag. (If the aligner does not set this field, multiply aligned reads will be counted multiple times.)
![Page 67: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/67.jpg)
67
Quantification with cufflinks
http://cufflinks.cbcb.umd.edu/
– assembles transcripts
– estimates their abundances : based on how many reads support each one
– tests for differential expression in RNA-Seq samples
![Page 68: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/68.jpg)
68
– Violet fragment: from which transcript?
● Use of Fragment length distribution
Cufflinks read attribution
![Page 69: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/69.jpg)
69
– Fragments attribution
– Isoforms abundances estimation:● RPKM for single reads● FPKM for paired-end reads
Cufflinks expression measurement
Trapnell C et al. Nature Biotechnology 2010;28:511-515
![Page 70: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/70.jpg)
70
RPKM / FPKM
– Transcript length bias
– RPKM : Reads per kilobase of exon per million mapped reads
● 1kb transcript with 1000 alignments in a sample of 10 million reads (out of which 8 million reads can be mapped) will have:
RPKM = 1000/(1 * 8) = 125
– the transcript length depends on isoform inference
– FPKM : for paired-end sequencing● A pair of reads constitute one fragment
![Page 71: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/71.jpg)
71
– Command line:● cufflinks [options]* <aligned_reads.(sam/bam)>
– Some options :
-h/--help
-o/--output-dir
-p/--num-threads
-G/--GTF <reference_annotation.(gtf/gff)> : estimate isoform expression, no assembly novel transcripts
Cufflinks inputs and options
![Page 72: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/72.jpg)
72
– Each quantification is produced from a bam file corresponding to a sample
– The quantification column has to be extracted
– The columns are the joined (paste)
– A header is added to the count file
Merging individual count files
![Page 73: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/73.jpg)
73
Statistical analysis in R (DESeq2 / edgeR)
![Page 74: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/74.jpg)
74
Hands-on : quantification
1/ Quantify the genes of chromosome 22 using htseq-count and the Ensembl GTF file for both samples.
2/ Merge both files to produce the count tables. Add a header to the count table.
3/ create the count table dotplot
![Page 75: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/75.jpg)
75
Hands-on : hints
samtools view ERR022486/accepted_hits.bam | htseq-count --stranded=no -m intersection-nonempty - /work/.../Danio_rerio_chr22.Zv9.62.gtf -q > ERR022486/accepted_hits.bam.htseq-count_nonempty_nostranded &
The same for ERR022488
paste ERR022486/accepted_hits.bam.htseq-count_nonempty_nostranded ERR022488/accepted_hits.bam.htseq-count_nonempty_nostranded | cut -f1,2,4 > All.htseq-count
![Page 76: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/76.jpg)
76
Analyse workflow
Data quality control
Spliced mapping
Gene and transcript discoveryGene and transcript discovery
QuantificationQuantification
![Page 77: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/77.jpg)
77
Transcript reconstruction
The different ways :
• Finding the gene locations• Finding the exons• Finding the junctions :
• Between pairs junctions• Within sequences junction
Defining the model building strategy
• Number of built models• Intronic reads
![Page 78: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/78.jpg)
78
The elements of the model
gene locationExon locationJunctions :- Between read pair junction- Within read junction
![Page 79: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/79.jpg)
79
Model building strategies
![Page 80: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/80.jpg)
80
Cufflinks
http://cufflinks.cbcb.umd.edu/
– assembles transcripts
– estimates their abundances : based on how many reads support each one
– tests for differential expression in RNA-Seq samples
![Page 81: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/81.jpg)
81
Cufflinks transcript assembly
– Transcripts assembly :
● Fragments are divided into non-overlapping loci
● each locus is assembled independently :
– Cufflinks assembler
● find the mini nb of transcripts that explain the reads
● find a minimum path cover ( Dilworth's theorem) :
– nb incompatible read = mini nb of transcripts needed
– each path = set of mutually compatible fragments overlapping each other
Trapnell C et al. Nature Biotechnology 2010;28:511-515
![Page 82: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/82.jpg)
82
Cufflinks transcript assembly
– Transcripts assembly :
● Identification incompatibles
fragments: distinct isoforms
● Compatibles fragments
are connected: graph construction
Trapnell C et al. Nature Biotechnology 2010;28:511-515
![Page 83: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/83.jpg)
83
– Command line:● cufflinks [options]* <aligned_reads.(sam/bam)>
– Some options :
-h/--help
-o/--output-dir
-p/--num-threads
-G/--GTF <reference_annotation.(gtf/gff)> : estimate isoform expression, no assembly novel transcripts
-g/--GTF-guide <reference_annotation.(gtf/gff)> : guide RABT (Reference Annotation Based Transcript) assembly
Cufflinks inputs and options
![Page 84: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/84.jpg)
84
– Some options :
-g/--GTF-guide <reference_annotation.(gtf/gff)> : guide RABT assembly
Cufflinks RABT assembly option
Roberts A et al. Bioinformatics 2011;27:2325-2329
![Page 85: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/85.jpg)
85
● transcripts.gtf : contains assembled isoforms (coordinates and abundances)
● genes.fpkm_tracking: contains the genes FPKM● isoforms.fpkm_tracking: contains the isoforms FPKM
Cufflinks outputs
![Page 86: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/86.jpg)
86
– transcripts.gtf (coordinates and abundances): contains assembled isoforms: can be visualized with a genome viewer
● GTF format + attributes (ids, FPKM, confidence inteval bounds, depth or read coverage, all introns and exons covered)
Cufflinks GTF description
Chr Source Feature Start End
Score:Most abundant isoform = 1000Minor : ratio=minor Fpkm/major FPKM
strand Frame
AttributesGTF format
Whether or not all introns and exons were fully covered by Reads (with -g)
![Page 87: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/87.jpg)
87
– transcripts.gtf (coordinates and abundances): contains assembled isoforms: can be visualized with a genome viewer
● IGV visualization
Cufflinks GTF description
![Page 88: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/88.jpg)
88
Gene discovery pipeline
Alignment (Tophat)
Bam merge (samtools)
Discovery of novels features (cufflinks)
Quantification at the gene level level (htseq-count)
Quantification file merging (shell script)
![Page 89: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/89.jpg)
89
– First set your gene and transcript model = build a reference GTF file
– Then use option -G to quantify the same set of elements on all your samples with sigcufflinks
– Then sort your raw_transcript.tsv files
– cut the second or third column of the sorted file
– Paste all the column in the count file
Quantification strategy
![Page 90: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/90.jpg)
90
Hands-on : cufflinks
Merge all bam files using samtools merge.
Run cufflinks to discover new genes and transcripts using the merged bam file
![Page 91: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/91.jpg)
91
Hands-on : commands
Merge all bam files :
samtools merge ALL.bam ERR022486/accepted_hits.bam ERR022488/accepted_hits.bam
Cufflinks command:
cufflinks -- output-dir=CUFFLINKS -g Danio_rerio_chr22.Zv9.62.gtf ALL.bam
![Page 92: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/92.jpg)
92
Conclusions
RNASeq analysis are performed routinely.
There are still some questions about the best possible aligner or gene seeker but some tools are now well established as good solutions.
The number of replicates is particularly important if the expression difference is small between conditions.
Pay attention to the correspondence between your library type and the program parameters you use.
![Page 93: bio-informatic analysis of RNASeq datagenoweb.toulouse.inra.fr/~formation/Bruxlles_RNAseq/...bio-informatic analysis of RNASeq data Christophe Klopp / , bioinfo.genotoul.fr 2 Overview](https://reader035.vdocuments.net/reader035/viewer/2022071006/5fc38a686d2a4a403b6e2cbc/html5/thumbnails/93.jpg)
93
Questions ?