nm11

1
IsoEM vs. DGE-EM DGE-EM Observed data: tag classes= tagsand their m atchingsitesin the isoform s foreverytag, m atchingsites are w eighted based on tag qualityscores( Q ) Hidden variables: n(i,s) = num beroftags com ingfrom site sin isoform i Param eters: f i = frequencyofisoform i , for all i IsoEM Observed data: read classes= readsand their m atchingpositionsin the isoform s foreveryread, m atchingpositions are w eighted based on read quality scores( Q ), read orientation ( O )and fragm entlength probability distribution ( F) Hidden variables: n(i) = num berofreadscom ing from isoform i Param eters: f i = frequencyofisoform i , forall i Motivation M assivelyparallel sequencing isquicklyreplacing m icroarraysasthe technologyofchoice forperform inggene expression profiling Tw o m ain protocolsproposed in the literature: RNA-Seq: generatesshort(single orpaired)sequencing tagsfrom the endsof random lygenerated cDNA fragm ents. Digital Gene Expression (DGE) [1], orhigh-throughputsequencing based Serial AnalysisofGene Expression (SAGE-Seq)[2]: generatessingle cDNA tagsusing an assay including asm ain stepstranscriptcapture and cDNA synthesisusing oligo(dT)beads, cDNA cleavage w ith an anchoring enzym e, and release of cDNA tagsusing a tagging enzym e w hose recognition site isligated upstream of the recognition site ofthe anchoring enzym e W e presentresultsofa sim ulation studycom paring the accuracy achieved bythe tw o protocolsbased on state-of-the-artinference algorithm susing the expectationm axim ization (EM )fram ework In Silico Experim ental Setup UCSC hum an genom e hg19 5 tissuesfrom the GNFAtlas2 table Rna-Seq: 1-30 m illion reads lengths14-100 bp norm al fragm entlength distribution N(250, 25) Algorithm : IsoEM [3] DGE: 1-30 m illion tags lengths14-26 bp cutting enzym esfrom the Restriction Enzym e Database: DpnII[1]w ith recognition site GATC NlaIII[2]w ith recognition site CATG CviJIw ith degenerate recognition site RGCY (R=G orA, Y=C orT) Algorithm :DGE-EM DGE-EM algorithm E-step M -step RNA-Seq 14 18 21 26 36 50 75 100 0 5 10 15 20 25 1000000 5000000 10000000 15000000 30000000 Read Length M PE #Reads 0-5 5-10 10-15 15-20 20-25 M edian PercentError(M PE) Isoform MPE % Isof. Cut Gene M PE % Genes Cut Protocol Avg. S.D. Avg. S.D. RNA-Seq 20.6 0.14 N/A 8.3 0.17 N/A DGE Full digest p=1 GATC 30.1 0.09 95% 4.7 0.27 95% CATG 25.8 0.07 97% 2.9 0.08 98% RGCY 24.1 0.08 98% 2.5 0.05 99.9% Partial digest p=0.5 GATC 24.4 0.22 95% 3.7 0.17 95% CATG 22.8 0.19 97% 2.5 0.05 98% RGCY 23.9 0.11 98% 2.3 0.06 99.9% DGE enzym e CATG p=.5 1000000 5000000 10000000 15000000 30000000 0 2 4 6 8 10 12 14 14 18 21 24 26 #Tags M PE TagLength 0-2 2-4 4-6 6-8 8-10 10-12 12-14 DGE enzym e RGCY p=.5 1000000 5000000 10000000 15000000 30000000 0 2 4 6 8 10 12 14 14 18 21 24 26 #Tags M PE TagLength 0-2 2-4 4-6 6-8 8-10 10-12 12-14 DGE enzym e GATC p=.5 1000000 5000000 10000000 15000000 30000000 0 2 4 6 8 10 12 14 16 14 18 21 24 26 #Tags M PE TagLength 0-2 2-4 4-6 6-8 8-10 10-12 12-14 14-16 Tag-Isoform Com patibility DGE C leave w ith anchoring enzym e (A E) A ttach prim er for tagging enzym e (TE) C leave w ith tagging enzym e AAAAA AAAAA CATG AE TCCRAC AAAAA CATG AE TE CATG M ap tags A B C D E G ene Expression (G E) Isoform Expression (IE) RNA-Seq A B C D E M ake cD N A & shatter into fragm ents Sequence fragm ent ends M ap reads G ene Expression (G E) Isoform Expression (IE) IsoEM algorithm E-step M -step Conclusions Forisoform expression inference, RNA-Seq yieldsslightly betteraccuracy com pared to DGE, w hereasforgene expression DGE hassignificantly better accuracy w hen enough isoform sare cutbythe restriction enzym e used. DGE accuracy im provesw ith the percentage ofisoform scutand isbetterfor partial digestion com pared to com plete digestion. Using enzym esw ith degenerate recognition sites, such asCviJI, yieldsbetter accuracy than using enzym esfrom published studies ACKNOW LEDGEM ENTS Thisw ork w assupported in partbyNSF aw ardsIIS-0546457 and IIS-0916948 REFERENCES [ 1] Y. Asm ann,E. Klee, E. A. Thom pson, E. Perez, S. M iddha,A. Oberg,T. Therneau,D. Sm ith, G. Poland, E. W ieben, and J.-P. Kocher, “3’ tag digitalgene expression profiling ofhum an brain and universalreference RNA using Illum ina Genom e Analyzer,” BMC Genomics, vol. 10, no. 1, p. 531, 2009. [2] Z. J. W u, C. A. M eyer, S. Choudhury,M . Shipitsin,R. M aruyam a,M . Bessarabova, T. Nikolskaya,S. Sukum ar,A. Schw artzm an, J. S. Liu,K. Polyak, and X. S. Liu,“Gene expression profiling ofhum an breasttissue sam plesusing SAGE-Seq,” Genome Research, vol. 20, no. 12, pp. 1730–1739,2010. [3] M . Nicolae, S. M angul,I. I. M andoiu,and A. Zelikovsky,“Estim ation ofalternative splicing isoform frequencies from RNA-Seq data,” in Proc. WABI, ser. Lecture Notes in Computer Science, V. M oulton and M . Singh,Eds., vol. 6293. Springer,2010,pp. 202–214. Read-Isoform Com patibility a a a a i r F Q O w , i r w , Empirical Comparison of Protocols for Sequencing- Based Gene and Isoform Expression Profiling Marius Nicolae and Ion Măndoiu (University of Connecticut)

Upload: lynnea

Post on 25-Feb-2016

64 views

Category:

Documents


4 download

DESCRIPTION

Empirical Comparison of Protocols for Sequencing-Based Gene and Isoform Expression Profiling Marius Nicolae and Ion M ă ndoiu (University of Connecticut). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: NM11

IsoEM vs. DGE-EM

DGE-EMObserved data: • tag classes = tags and their

matching sites in the isoforms• for every tag, matching sites

are weighted based on tag quality scores (Q)

Hidden variables: • n(i,s) = number of tags

coming from site s in isoform iParameters: • fi = frequency of isoform i, for

all i

IsoEMObserved data: • read classes = reads and their

matching positions in the isoforms• for every read, matching positions

are weighted based on read quality scores (Q), read orientation (O) and fragment length probability distribution (F)

Hidden variables: • n(i) = number of reads coming

from isoform iParameters: • fi = frequency of isoform i, for all i

Motivation• Massively parallel sequencing is quickly replacing microarrays as the

technology of choice for performing gene expression profiling • Two main protocols proposed in the literature:

– RNA-Seq: generates short (single or paired) sequencing tags from the ends of randomly generated cDNA fragments.

– Digital Gene Expression (DGE) [1], or high-throughput sequencing based Serial Analysis of Gene Expression (SAGE-Seq) [2]: generates single cDNA tags using an assay including as main steps transcript capture and cDNA synthesis using oligo(dT) beads, cDNA cleavage with an anchoring enzyme, and release of cDNA tags using a tagging enzyme whose recognition site is ligated upstream of the recognition site of the anchoring enzyme

• We present results of a simulation study comparing the accuracy achieved by the two protocols based on state-of-the-art inference algorithms using the expectationmaximization (EM) framework

In Silico Experimental Setup• UCSC human genome hg19• 5 tissues from the GNFAtlas2 table

Rna-Seq:• 1-30 million reads• lengths 14-100 bp• normal fragment length

distribution N(250, 25)• Algorithm: IsoEM [3]

DGE: • 1-30 million tags • lengths 14-26 bp• cutting enzymes from the Restriction

Enzyme Database:• DpnII [1] with recognition site GATC• NlaIII [2] with recognition site CATG• CviJI with degenerate recognition site

RGCY (R=G or A, Y=C or T)• Algorithm: DGE-EM

DGE-EM algorithm

E-step

M-step

RNA-Seq

1418

2126

36

50

75

100

0

5

10

15

20

25

10000005000000

1000000015000000

30000000

Read Length

MPE

#Reads

0-5 5-10 10-15 15-20 20-25

Median Percent Error (MPE)

Isoform MPE %Isof. Cut

Gene MPE %Genes CutProtocol Avg. S.D. Avg. S.D.

RNA-Seq 20.6 0.14 N/A 8.3 0.17 N/A

DGE

Full digestp=1

GATC 30.1 0.09 95% 4.7 0.27 95%

CATG 25.8 0.07 97% 2.9 0.08 98%

RGCY 24.1 0.08 98% 2.5 0.05 99.9%

Partial digestp=0.5

GATC 24.4 0.22 95% 3.7 0.17 95%

CATG 22.8 0.19 97% 2.5 0.05 98%

RGCY 23.9 0.11 98% 2.3 0.06 99.9%

DGE enzyme CATG p=.5

1000000

5000000

10000000

15000000

30000000

0

2

4

6

8

10

12

14

14

18

21

24

26

#Tags

MPE

Tag Length

0-2 2-4 4-6 6-8 8-10 10-12 12-14

DGE enzyme RGCY p=.5

1000000

5000000

10000000

15000000

30000000

0

2

4

6

8

10

12

14

14

18

21

24

26

#Tags

MPE

Tag Length

0-2 2-4 4-6 6-8 8-10 10-12 12-14

DGE enzyme GATC p=.5

1000000

5000000

10000000

15000000

30000000

0

2

4

6

8

10

12

14

16

1418

21

24

26

#Tags

MPE

Tag Length

0-2 2-4 4-6 6-8 8-10 10-12 12-14 14-16

Tag-Isoform Compatibility

DGECleave with anchoring enzyme (AE)

Attach primer for tagging enzyme (TE)

Cleave with tagging enzyme

AAAAA

AAAAACATG

AE

TCCRAC AAAAACATG

AETE

CATG

Map tags

A B C D E

Gene Expression (GE) Isoform Expression (IE)

RNA-Seq

A B C D E

Make cDNA & shatter into fragments

Sequence fragment ends

Map reads

Gene Expression (GE) Isoform Expression (IE)

IsoEM algorithm

E-step

M-step

Conclusions• For isoform expression inference, RNA-Seq yields slightly better accuracy

compared to DGE, whereas for gene expression DGE has significantly better accuracy when enough isoforms are cut by the restriction enzyme used.

• DGE accuracy improves with the percentage of isoforms cut and is better for partial digestion compared to complete digestion.

• Using enzymes with degenerate recognition sites, such as CviJI, yields better accuracy than using enzymes from published studies

ACKNOWLEDGEMENTSThis work was supported in part by NSF awards IIS-0546457 and IIS-0916948

REFERENCES[1] Y. Asmann, E. Klee, E. A. Thompson, E. Perez, S. Middha, A. Oberg, T. Therneau, D. Smith, G. Poland, E. Wieben,

and J.-P. Kocher, “3’ tag digital gene expression profiling of human brain and universal reference RNA using Illumina Genome Analyzer,” BMC Genomics, vol. 10, no. 1, p. 531, 2009.

[2] Z. J. Wu, C. A. Meyer, S. Choudhury, M. Shipitsin, R. Maruyama, M. Bessarabova, T. Nikolskaya, S. Sukumar, A. Schwartzman, J. S. Liu, K. Polyak, and X. S. Liu, “Gene expression profiling of human breast tissue samples using SAGE-Seq,” Genome Research, vol. 20, no. 12, pp. 1730–1739,2010.

[3] M. Nicolae, S. Mangul, I. I. Mandoiu, and A. Zelikovsky, “Estimation of alternative splicing isoform frequencies from RNA-Seq data,” in Proc. WABI, ser. Lecture Notes in Computer Science, V. Moulton and M. Singh, Eds., vol. 6293. Springer, 2010, pp. 202–214.

Read-Isoform Compatibility

a

aaair FQOw ,

irw ,

Empirical Comparison of Protocols for Sequencing-Based Gene and Isoform Expression Profiling

Marius Nicolae and Ion Măndoiu (University of Connecticut)