methods for analysis of rna-seq data.rivals/shd-2010/speech/richard... · estimating count...
TRANSCRIPT
![Page 1: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/1.jpg)
Methods for Analysis of RNA-Seq data.
Hugues Richard
![Page 2: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/2.jpg)
RNA-Seq protocol (no strand information)
M. Schulz©
![Page 3: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/3.jpg)
Biological Experiment
“bag of transcript positions”
TTGGGAAAG
Extraction of Poly A mRNA, ds cDNA synthesis, amplification
Total RNANucleus
Cytosol
GTGAAAAAA
ATGAAGAAAAA
ATGGAAAAA
TTGTATAAAAA
TAGGATAAAAA
TAGAAGAAAAA
Variability of the counts (sampling)influenced by:
_ region length_ copy number
![Page 4: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/4.jpg)
Aggregate counts on exons/genes
Normalize by number of possible hits (RPKM)
Mapping the reads
Sultan*, Schulz*, Richard* et al. Science 2008
Mortazavi*, Williams* et al. Nat Methods 2008
Wang*, Sandberg* et al. Nature 2008
Cloonan*, Forest*, Kolle* et al. Nat Methods 2008
![Page 5: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/5.jpg)
Outline
Estimating depth of sequencing. Do we see all the transcriptional units ?
Infering new events: Detection of new transcriptional units. Detection of Alternative Splicing Events.
RGASP competition: Transcriptome assembly with Oases
(Schulz/Zerbino) Reads remapping with RazerS (Weese)
![Page 6: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/6.jpg)
Transcriptome vs genome assembly
• RNA-Seq reads are distributed according to transcript expression levels.
# of non observed genes ?
![Page 7: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/7.jpg)
What is current coverage ?Total number of transcripts :
Count of a transcriptional unit i:
Unspecified counts:
If we estimate g, then:
Expected number of new discoveries with more lanes
(Fisher & Corbett 43)
![Page 8: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/8.jpg)
Estimating count frequency law
Penalized Non Parametric Maximum Likelihood method (Wang & Lindsay 05)
Freq
uenc
y
Counts in transcriptional unit
2 lanes
![Page 9: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/9.jpg)
Dynamic range
Merged
Micro arrays
Number of mapped reads
Num
ber o
f tra
nscr
iptio
nal u
nits
~ 4 Mio reads/replicate
![Page 10: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/10.jpg)
Outline
Estimating depth of sequencing. do we see all the genes ?
Infering new events: Detection of new transcriptional units. Alternative Splicing events detection.
RGASP competition: Transcriptome assembly with Oases
(Schulz/Zerbino) Mapping with RazerS (Weese)
![Page 11: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/11.jpg)
Alternative Exon Events (AEEs)
Ladd and Cooper 2002
![Page 12: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/12.jpg)
Splice junctions
Align unmatched reads to a set of artificially created junctions
Use a splice junction aligner (Tophat, QPalma, GEMM)
![Page 13: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/13.jpg)
Detecting AEE from Exons Expression
Validations on 2 datasets (HEK and B cells)
Richard*, Schulz*, Sultan* et al. NAR 2010
![Page 14: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/14.jpg)
Within cell AS : CASI
61 events tested35 validated
![Page 15: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/15.jpg)
1000 simulations of alternative exon eventsof one gene with 6 exons and 300 reads in total
Robustness (simulation)
![Page 16: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/16.jpg)
Bootstrap:
_ Randomly alter exon boundaries
_ Monitor changes in prediction
Robustness (bootstrap)
(500 repetitions)
![Page 17: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/17.jpg)
Alternative Polyadenylation in HIP2+
+Sandberg et al. Science, 2008
![Page 18: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/18.jpg)
Detecting AEE from Exons Expression
![Page 19: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/19.jpg)
Isoform quantification: POEM
We observe only the counts falling
within each subexon
Count for each isoform j
binning within every exon at random :
![Page 20: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/20.jpg)
Isoform quantification: POEM
Comparison to:_ qPCR (47 events)_ Estimate from junction counts
(267 events)grey - POEMblack - qPCR
![Page 21: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/21.jpg)
PCC: 0.65 0.81
Validation
Junction reads qRT-PCR
![Page 22: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/22.jpg)
_ produced 4 replicates Exon Array hybridizations for each cell line_ based on ENSEMBL 25% more exona are detected by RNA-Seq
Comparison to Exon arrays
![Page 23: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/23.jpg)
Outline
Coverage for count based problems. do we see all the genes ?
Infering new events: Detection of new transcriptional units. Alternative Splicing events detection.
RGASP competition: Transcriptome assembly with Oases
(Schulz/Zerbino) Mapping with RazerS (Weese)
![Page 24: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/24.jpg)
Transcripts assembly
AGTCGAG CTTTAGA CGATGAG CTTTAGA GTCGAGG TTAGATC ATGAGGC GAGACAG GAGGCTC GTCCGAT AGGCTTT GAGACAG AGTCGAG TAGATCC ATGAGGC TAGAGAATAGTCGA CTTTAGA CCGATGA TTAGAGA CGAGGCT AGATCCG TGAGGCT AGAGACATAGTCGA GCTTTAG TCCGATG GCTTTAG TCGATTGC GATCCGA GAGGCTT AGAGACATAGTCGA TTAGATC GATGAGG TTTAGAG GTCGAGG TCTAGAT ATGAGGC TAGAGAC AGGCTTT GTCCGAT AGGCTTT GAGACAG AGTCGAG TTAGATA ATGAGGC AGAGACA GGCTTTA TCCGATG TTTAGAG CGAGGCT TAGATCC TGAGGCT GAGACAG AGTCGAG TTTAGATC ATGAGGC TTAGAGA
Imagine two transcripts:TAGTCGAG GCTT TAGAGACAGTAGTCGAG TCCGA TAGAGACAG
Assemble reads into contigs
Oases:_De Bruijn graph_Velvet framework
Schulz, Zerbino et al (in preparation)
![Page 25: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/25.jpg)
De Novo transcripts assembly
Advantages: No reference or bad quality genome Cancer transcriptomics (genes fusion) Micro exons
Assembly specific challenges: sequencing errors are hard to rescue differentiation of (post-)transcriptional modifications like alt. splicing, alt.
polyadenylation, alt. first exon or trans-splicing judgement of assembly quality without reference
paralogous domain genes
![Page 26: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/26.jpg)
Workflow for RGASP
![Page 27: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/27.jpg)
From contigs to transcripts
weighted contig graph reconstruction hard (Lacroix et al. WABI 2008)
iterative maximum likelihood reconstruction method (heuristic) (Lee 2003)
43
14
38
21
41
17
40
7
![Page 28: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/28.jpg)
From contigs to transcripts
weighted contig graph reconstruction hard (Lacroix et al. WABI 2008)
iterative maximum likelihood reconstruction method (heuristic) (Lee 2003)
43
14
38
21
41
17
40
7
![Page 29: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/29.jpg)
From contigs to transcripts
weighted contig graph reconstruction hard (Lacroix et al. WABI 2008)
incorporation of paired-end reads (transitive reduction)
43
14
38
21
41
17
40
7
52
25
![Page 30: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/30.jpg)
From contigs to transcripts
weighted contig graph reconstruction hard (Lacroix et al. WABI 2008)
incorporation of paired-end reads (transitive reduction)
43
14
38
21
41
17
40
7
52
25
![Page 31: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/31.jpg)
Map Reads onto Transcripts
map all reads onto assembled transcripts
allow for mismatches, indels
record all (multiple) matches
- RazerS - Fast Read Mapping with Sensitivity Control (Weese et al. 2009)
![Page 32: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/32.jpg)
Compute Coverage for Contig Tuples for every read match find tuples of covered contigs
count for every tuple the number of covering reads reweight multiple and paired-end matches
A
B
C
D
E
F
G
A B D E G
A C D F G
![Page 33: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/33.jpg)
Perspectives
Estimating the depth of sequencing Estimates for the total number of transcripts
Infering new events Automatic correction of experimental biases
Stay tuned for RGASP results
![Page 34: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/34.jpg)
Acknowledgements
MPIMG (Berlin)
Marcel Schulz
Marc Sultan
Alon MagenTatjana BorodinaAleksey SoldatovDmitri ParkhomchukStefan HaasMartin VingronHans LehrachMarie-Laure Yaspo
RGASP pipeline:
Marcel SchulzDaniel Zerbino (UCSC, former EBI)David Weese (FU Berlin)
Ewan Birney (EBI)Knut Reinert (FU Berlin)Martin Vingron (MPIMG)
Now at Cordeliers Come and visit !
![Page 35: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/35.jpg)
![Page 36: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/36.jpg)
Results on Drosophila Paired-end reads
read length=36k=21fragment length=200
#S2-DRSC Paired-reads
# loci # transcripts median length
mapped to genome
Overlapping ENSEMBL transcripts
#identified exons
43,836,085* 32,905 53,299 214 89% 61% 72,892
*high quality reads
human, fly, and worm predictions submitted to RGASP competition
![Page 37: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/37.jpg)
exon based index is a robust z-score estimated with medianand median absolut deviation
Statistical tests and assumptions
![Page 38: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/38.jpg)
CASI complements splice junctions
Internal3 prime5 prime
![Page 39: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/39.jpg)
Examplesunannotated gene with 7 exons (locus 136)
transcript connecting 2 genes (locus 2589)
assembled transcript
![Page 40: Methods for Analysis of RNA-Seq data.rivals/SHD-2010/SPEECH/Richard... · Estimating count frequency law ... reconstruction hard (Lacroix et al. WABI 2008) ... fragment length=200](https://reader033.vdocuments.net/reader033/viewer/2022052608/5ae873557f8b9a9e5d90937a/html5/thumbnails/40.jpg)
Exons Arrays vs RNA-Seq