transcriptome assembly and quantification from ion torrent rna-seq data
DESCRIPTION
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data. Alex Zelikovsky Department of Computer Science Georgia State University. Joint work with Serghei Mangul, Sahar Al Seesi, Adrian Caciula, Dumitru Brinza , Ion Mandoiu. Advances in Next Generation Sequencing. - PowerPoint PPT PresentationTRANSCRIPT
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data
Alex Zelikovsky Department of Computer Science
Georgia State University
Joint work with Serghei Mangul, Sahar Al Seesi, Adrian Caciula, Dumitru Brinza, Ion Mandoiu
2
Advances in Next Generation Sequencing
http://www.economist.com/node/16349358
Roche/454 FLX Titanium400-600 million reads/run
400bp avg. length
Illumina HiSeq 2000Up to 6 billion PE reads/run
35-100bp read length
SOLiD 4/55001.4-2.4 billion PE reads/run
35-50bp read length
Ion Proton Sequencer
3
RNA-SeqRNA-Seq
A B C D E
Make cDNA & shatter into fragments
Sequence fragment ends
Map reads
Gene Expression
A B C
A C
D E
Transcriptome Reconstruction Isoform Expression
4
Transcriptome Assembly
• Given partial or incomplete information about something, use that information to make an informed guess about the missing or unknown data.
5
Transcriptome Assembly Types
• Genome-independent reconstruction (de novo)– de Brujin k-mer graph
• Genome-guided reconstruction (ab initio)– Spliced read mapping – Exon identification– Splice graph
• Annotation-guided reconstruction– Use existing annotation (known transcripts) – Focus on discovering novel transcripts
6
Previous approaches
• Genome-independent reconstruction – Trinity(2011), Velvet(2008), TransABySS(2008)
• Genome-guided reconstruction – Scripture(2010)
• Reports “all” transcripts– Cufflinks(2010), IsoLasso(2011), SLIDE(2012),
CLIIQ(2012), TRIP(2012), Traph (2013)• Minimizes set of transcripts explaining reads
• Annotation-guided reconstruction– RABT(2011), DRUT(2011)
7
Gene representation
• Pseudo-exons - regions of a gene between consecutive transcriptional or splicing events
• Gene - set of non-overlapping pseudo-exons
e1 e3 e5
e2 e4 e6
Spse1Epse1
Spse2
Epse2Spse3
Epse3
Spse4
Epse4
Spse5
Epse5 Spse6
Epse6
Spse7Epse7
Pseudo-exons:
e1 e5
pse1 pse2 pse3 pse4 pse5 pse6 pse7
Tr1:
Tr2:
Tr3:
8
Splice GraphGenome
1 42 3 5 6 7 8 9
TSSpseudo-exons
TES
• Map the RNA-Seq reads to genome
• Construct Splice Graph - G(V,E)– V : exons– E: splicing events
• Candidate transcripts– depth-first-search (DFS)
• Select candidate transcripts– IsoEM– greedy algorithm
9
Genome
MaLTA Maximum Likelihood Transcriptome Assembly
10
How to select?
• Select the smallest set of candidate transcripts • covering all transcript variants
Transcript : set of transcript variants
Sharmistha Pal, Ravi Gupta, Hyunsoo Kim, et al., Alternative transcription exceeds alternative splicing in generating the transcriptome diversity of cerebellar development, Genome Res. 2011 21: 1260-1272
alternative first exon alternative last exon exon skipping intron retention
alternative 5' splice junction alternative 5' splice junction splice junction
IsoEM: Isoform Expression Level Estimation
• Expectation-Maximization algorithm• Unified probabilistic model incorporating
– Single and/or paired reads– Fragment length distribution– Strand information– Base quality scores– Repeat and hexamer bias correction
Read-isoform compatibility graphirw ,
a
aaair FQOw ,
Fragment length distribution
A B C
A C
A B C
A C
A B C
A C
i
j
Series1
Fa(i)
Series1
Fa (j)
14
Greedy algorithm
1. Sort transcripts by inferred IsoEM expression levels in decreasing order
2. Traverse transcripts – Select transcripts if it contains novel transcript
variant– Continue traversing until all transcript variant
are covered
15
Greedy algorithm Transcript Variants:Transcripts sorted by expression levels
16
Greedy algorithm Transcript Variants:Transcripts sorted by expression levels
17
Greedy algorithm Transcript Variants:Transcripts sorted by expression levels
18
Greedy algorithm Transcript Variants:Transcripts sorted by expression levels
19
Greedy algorithm Transcript Variants:Transcripts sorted by expression levels
20
Greedy algorithm Transcript Variants:Transcripts sorted by expression levels
21
Greedy algorithm Transcript Variants:Transcripts sorted by expression levels
22
Greedy algorithm Transcript Variants:Transcripts sorted by expression levels
23
Greedy algorithm Transcript Variants:Transcripts sorted by expression levels
24
Greedy algorithm Transcript Variants:Transcripts sorted by expression levels
25
Greedy algorithm Transcript Variants:Transcripts sorted by expression levels
STOP. All transcript variant are covered.
26
MaLTA results on GOG-350 dataset
• 4.5M single Ion reads with average read length 121 bp, aligned using TopHat2• Number of assembled transcripts
– MaLTA : 15385 – Cufflinks : 17378
• Number of transcripts matching annotations– MaLTA : 4555(26%) – Cufflinks : 2031(13%)
Expression Estimation on Ion Torrent reads
IsoEM HBR Cufflinks HBR IsoEM UHR Cufflinks UHR0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
R2 fo
r Iso
EM/C
ufflin
ks E
stim
ates
vs q
PCR
• Squared correlation– IsoEM / Cufflinks FPKMs vs qPCR values for 800 genes – 2 MAQC samples : Human Brain and Universal
28
Conclusions
• Novel method for transcriptome assembly • Validated on Ion Torrent RNA-Seq Data• Comparing with Cufflinks:
– similar number of assembled transcripts– 2x more previously annotated transcripts
• Transcript quantification is useful for transcript assembly better quantification?
29