transcriptome assembly and quantification from ion torrent rna-seq data

Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data

Alex Zelikovsky Department of Computer Science

Georgia State University

Joint work with Serghei Mangul, Sahar Al Seesi, Adrian Caciula, Dumitru Brinza, Ion Mandoiu

2

Advances in Next Generation Sequencing

http://www.economist.com/node/16349358

Roche/454 FLX Titanium400-600 million reads/run

400bp avg. length

Illumina HiSeq 2000Up to 6 billion PE reads/run

35-100bp read length

SOLiD 4/55001.4-2.4 billion PE reads/run

35-50bp read length

Ion Proton Sequencer

3

RNA-SeqRNA-Seq

A B C D E

Make cDNA & shatter into fragments

Sequence fragment ends

Map reads

Gene Expression

A B C

A C

D E

Transcriptome Reconstruction Isoform Expression

4

Transcriptome Assembly

• Given partial or incomplete information about something, use that information to make an informed guess about the missing or unknown data.

5

Transcriptome Assembly Types

• Genome-independent reconstruction (de novo)– de Brujin k-mer graph

• Genome-guided reconstruction (ab initio)– Spliced read mapping – Exon identification– Splice graph

• Annotation-guided reconstruction– Use existing annotation (known transcripts) – Focus on discovering novel transcripts

6

Previous approaches

• Genome-independent reconstruction – Trinity(2011), Velvet(2008), TransABySS(2008)

• Genome-guided reconstruction – Scripture(2010)

• Reports “all” transcripts– Cufflinks(2010), IsoLasso(2011), SLIDE(2012),

CLIIQ(2012), TRIP(2012), Traph (2013)• Minimizes set of transcripts explaining reads

• Annotation-guided reconstruction– RABT(2011), DRUT(2011)

7

Gene representation

• Pseudo-exons - regions of a gene between consecutive transcriptional or splicing events

• Gene - set of non-overlapping pseudo-exons

e1 e3 e5

e2 e4 e6

Spse1Epse1

Spse2

Epse2Spse3

Epse3

Spse4

Epse4

Spse5

Epse5 Spse6

Epse6

Spse7Epse7

Pseudo-exons:

e1 e5

pse1 pse2 pse3 pse4 pse5 pse6 pse7

Tr1:

Tr2:

Tr3:

8

Splice GraphGenome

1 42 3 5 6 7 8 9

TSSpseudo-exons

TES

• Map the RNA-Seq reads to genome

• Construct Splice Graph - G(V,E)– V : exons– E: splicing events

• Candidate transcripts– depth-first-search (DFS)

• Select candidate transcripts– IsoEM– greedy algorithm

9

Genome

MaLTA Maximum Likelihood Transcriptome Assembly

10

How to select?

• Select the smallest set of candidate transcripts • covering all transcript variants

Transcript : set of transcript variants

Sharmistha Pal, Ravi Gupta, Hyunsoo Kim, et al., Alternative transcription exceeds alternative splicing in generating the transcriptome diversity of cerebellar development, Genome Res. 2011 21: 1260-1272

alternative first exon alternative last exon exon skipping intron retention

alternative 5' splice junction alternative 5' splice junction splice junction

IsoEM: Isoform Expression Level Estimation

• Expectation-Maximization algorithm• Unified probabilistic model incorporating

– Single and/or paired reads– Fragment length distribution– Strand information– Base quality scores– Repeat and hexamer bias correction

Read-isoform compatibility graphirw ,

a

aaair FQOw ,

Fragment length distribution

A B C

A C

A B C

A C

A B C

A C

i

j

Series1

Fa(i)

Series1

Fa (j)

14

Greedy algorithm

1. Sort transcripts by inferred IsoEM expression levels in decreasing order

2. Traverse transcripts – Select transcripts if it contains novel transcript

variant– Continue traversing until all transcript variant

are covered

15

Greedy algorithm Transcript Variants:Transcripts sorted by expression levels

16


17


18


19


20


21


22


23


24


25


STOP. All transcript variant are covered.

26

MaLTA results on GOG-350 dataset

• 4.5M single Ion reads with average read length 121 bp, aligned using TopHat2• Number of assembled transcripts

– MaLTA : 15385 – Cufflinks : 17378

• Number of transcripts matching annotations– MaLTA : 4555(26%) – Cufflinks : 2031(13%)

Expression Estimation on Ion Torrent reads

IsoEM HBR Cufflinks HBR IsoEM UHR Cufflinks UHR0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

R2 fo

r Iso

EM/C

ufflin

ks E

stim

ates

vs q

PCR

• Squared correlation– IsoEM / Cufflinks FPKMs vs qPCR values for 800 genes – 2 MAQC samples : Human Brain and Universal

28

Conclusions

• Novel method for transcriptome assembly • Validated on Ion Torrent RNA-Seq Data• Comparing with Cufflinks:

– similar number of assembled transcripts– 2x more previously annotated transcripts

• Transcript quantification is useful for transcript assembly better quantification?

transcriptome assembly and quantification from ion torrent rna-seq data

Documents

set of transcripts

novel transcripts

transcript variants

length solid

alternative splicing

expression levels illustration

length ion proton sequencer

length illumina hiseq