estimation of alternative splicing isoform frequencies from rna-seq data ion mandoiu computer...

26
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint work with Marius Nicolae, Serghei Mangul, and Alex Zelikovsky

Post on 19-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Estimation of alternative splicing isoform frequencies

from RNA-Seq data

Ion MandoiuComputer Science and Engineering Department

University of Connecticut

Joint work with Marius Nicolae, Serghei Mangul, and Alex Zelikovsky

Page 2: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Outline

• Introduction• EM Algorithm• Experimental results• Conclusions and future work

Page 3: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Alternative Splicing

[Griffith and Marra 07]

Page 4: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

RNA-Seq

A B C D E

Make cDNA & shatter into fragments

Sequence fragment ends

Map reads

Gene Expression (GE)

A B C

A C

D E

Isoform Discovery (ID) Isoform Expression (IE)

Page 5: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Gene Expression Challenges

• Read ambiguity (multireads)

• What is the gene length?

A B C D E

Page 6: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Previous approaches to GE

• Ignore multireads• [Mortazavi et al. 08]

– Fractionally allocate multireads based on unique read estimates

• [Pasaniuc et al. 10]– EM algorithm for solving ambiguities

• Gene length: sum of lengths of exons that appear in at least one isoform Underestimates expression levels for genes with 2 or

more isoforms [Trapnell et al. 10]

Page 7: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Read Ambiguity in IE

A B C D E

A C

Page 8: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Previous approaches to IE

• [Jiang&Wong 09]– Poisson model + importance sampling, single reads

• [Richard et al. 10]• EM Algorithm based on Poisson model, single reads in exons

• [Li et al. 10]– EM Algorithm, single reads

• [Feng et al. 10]– Convex quadratic program, pairs used only for ID

• [Trapnell et al. 10]– Extends Jiang’s model to paired reads– Fragment length distribution

Page 9: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Our contribution

• EM Algorithm for IE– Single and/or paired reads– Fragment length distribution– Strand information– Base quality scores

Page 10: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Read-Isoform Compatibilityirw ,

a

aaair FQOw ,

Page 11: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Fragment length distribution

• Paired reads

• Single reads

A B C

A C

A B C

A CA C

A B C

A B C

A C

A B C

A C

A B C

A C

Series1

Series1

Series1

Series1

Page 12: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

IsoEM algorithm

E-step

M-step

Page 13: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Simulation setup• Human genome UCSC known isoforms

• GNFAtlas2 gene expression levels– Uniform/geometric expression of gene isoforms

• Normally distributed fragment lengths– Mean 250, std. dev. 25

0 5 10 15 20 25 30 35 40 45 50 551

10

100

1000

10000

100000

Number of isoforms

Num

ber o

f gen

es

10

31.6227766...100

316.227766...1000

3162.27766...

10000

31622.7766...

1000000

5000

10000

15000

20000

25000

Isoform length

Num

ber o

f iso

form

s

Page 14: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Accuracy measures

• Error Fraction (EFt)– Percentage of isoforms (or genes) with relative

error larger than given threshold t• Median Percent Error (MPE)

– Threshold t for which EF is 50%• r2

Page 15: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Error Fraction Curves - Isoforms• 30M single reads of length 25

0 0.2 0.4 0.6 0.8 10

10

20

30

40

50

60

70

80

90

100

Uniq

Rescue

UniqLN

Cufflinks

RSEM

IsoEM

Relative error threshold

% o

f iso

form

s ov

er th

resh

old

Page 16: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Error Fraction Curves - Genes• 30M single reads of length 25

0 0.2 0.4 0.6 0.8 10

10

20

30

40

50

60

70

80

90

100

Uniq

Rescue

GeneEM

Cufflinks

RSEM

IsoEM

Relative error threshold

% o

f gen

es o

ver t

hres

hold

Page 17: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

MPE and EF15 by Gene Frequency• 30M single reads of length 25

Page 18: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Read Length Effect• Fixed sequencing throughput (750Mb)

25 35 45 55 65 75 85 950

5

10

15

20

25

Paired reads

Single reads

Read lengthM

edia

n Pe

rcen

t Err

or

25 35 45 55 65 75 85 950.962000000000001

0.964000000000001

0.966000000000001

0.968000000000001

0.970000000000001

0.972000000000001

0.974000000000001

0.976000000000001

0.978000000000001

Paired reads

Single reads

Read length

r2

Page 19: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Effect of Pairs & Strand Information

• 1-60M 75bp reads

0 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,0000.925

0.93

0.935

0.94

0.945

0.95

0.955

0.96

0.965

0.97

0.975

0.98

0.985

RandomStrand-Pairs

CodingStrand-pairs

RandomStrand-Single

CodingStrand-single

# reads

r2

Page 20: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Validation on Human RNA-Seq Data

• ≈8 million 27bp reads from two cell lines [Sultan et al. 10]• 47 AEEs measured by qPCR [Richard et al. 10]

0% 20% 40% 60% 80% 100%0%

20%

40%

60%

80%

100%

R² = 0.5433666236408

POEM

qPCR AE Fraction

Estim

ated

AE

Frac

tion

0% 20% 40% 60% 80% 100%0%

20%

40%

60%

80%

100%

R² = 0.472092562009362

Cufflinks

qPCR AE Fraction

Estim

ated

AE

Frac

tion

0% 20% 40% 60% 80% 100%0%

20%

40%

60%

80%

100%

R² = 0.610623442668948

IsoEM

qPCR AE Fraction

Estim

ated

AE

Frac

tion

Page 21: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Validation on Drosophila RNA-Seq Data

• [McManus et al. 10]

26M 42M 31M 78M Paired-end reads (37bp)

Page 22: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Allele Specific Expression in Parental Pool

1 100

1

100R² = 0.892234244861626

D.Mel.

D.M

el. I

n Pa

rent

al P

ool

1 100

0.000000001

0.0000001

0.00001

0.001

0.1

10R² = 0.933304143243501

D.Sec.

D.Se

c.in

Pare

ntal

Poo

l

Page 23: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Comparison to Pyrosequencing

-2 -1 0 1 2 3 4 5-2

-1

0

1

2

3

4

5

R² = 0.826523462271037R² = 0.896557530912755

HybridLinear (Hybrid)Parental Pool

Log2(M/S) pyroseq

Log2

(M/S

) Iso

EM

Page 24: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Runtime scalability

0 10000000 20000000 300000000

20

40

60

80

100

120

140

160

RandomStrand-Pairs

CodingStrand-Pairs

RandomStrand-Single

CodingStrand-Single

Million Fragments

CPU

Sec

onds

• Scalability experiments conducted on a Dell PowerEdge R900– Four 6-core E7450Xeon processors at 2.4Ghz, 128Gb of internal

memory

Page 25: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Conclusions & Future Work• Presented EM algorithm for estimating isoform/gene

expression levels– Integrates fragment length distribution, base qualities, pair and strand

info– Java implementation available at http://dna.engr.uconn.edu/software/IsoEM/

• Ongoing work– Correction for library preparation and sequencing biases

– E.g., random hexamer priming bias [Hansen et al. 10]– Comparison of RNA-Seq with DGE– Isoform discovery– Reconstruction & frequency estimation for virus quasispecies

Page 26: Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Acknowledgments NSF awards 0546457 & 0916948 to IM and 0916401 to AZ