rsem: accurate transcript quantification from rna-seq data with or without a reference genome li and...
TRANSCRIPT
![Page 1: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/1.jpg)
RSEM: accurate transcript quantification from RNA-Seq data
with or without a reference genome
Li and Dewey BMC Bioinformatics 2011, 12:323
Kim Dong-in
Bo Li1 and Colin N Dewey1,2*
![Page 2: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/2.jpg)
RNA-Seq
millions read reads end of cDNA from RNA fragment (single,pair) transcript quantification
multiple genes or isoforms
reads count, length
Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud
![Page 3: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/3.jpg)
Transcript quantification - mapping reads to genome, transcript set - estimation gene, isoform abundances Major complication - Not map uniquely to a single gene or iso-
form
Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud
![Page 4: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/4.jpg)
RSEM (RNASeq by Expectation Maximization)
transcript sequences not reference genome de novo transcriptome assembler Extension methodology paired-end, length reads, length distributions, quality
scores 95% credibility interval (CI) posterior mean estimate(PME) maximum likelihood (ML) estimate
abundance of each gene and isoform
Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud
![Page 5: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/5.jpg)
RSEM (RNASeq by Expectation Maximization)
In experiments best quantification accuracy short SE reads than PE reads in gene level same sequencing
quality scores is not significant. Illumina error only read sequences quantification accuracy
Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud
![Page 6: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/6.jpg)
count reads number, read length (mapped
uniquely gene) problems - mappability not in account : biased - alternatively-spliced genes : incorrect estimates - isoform abundances developed - address rescuing reads to multiple gene modeling by isoform level EM (expectation-maximization algorithm)
Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud
![Page 7: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/7.jpg)
similar statistical methods tools only RSEM, IsoEM handling reads mapped ambiguously iso-
forms and genes
RSEM (RNASeq by Expectation Maximization) - modeling RSPDs(start position distributions) - compute posterior mean estimate(PME) 95% credibility interval (CI) - designed without a whole genome sequence IsoEM - maximum likelihood (ML) estimate
Li and Dewey BMC Bioinformatics 2011, 12:323
Abstract + Backgroud - Related work
![Page 8: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/8.jpg)
RSEM (RNASeq by Expectation Maximization) 1. generate reference transcript sequences
2. aligned the reference - estimate abundances, credibility intervals scripts rsem-prepare-reference rsem-calculate-expression
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation
![Page 9: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/9.jpg)
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation
![Page 10: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/10.jpg)
designed to transcript sequences not whole genome
1. complicated alignment to genome ( eukaryotic ) splicing , polyadenylation challenging at genome level 2. transcript-level alignments easy, faster
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Reference sequence preparation
![Page 11: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/11.jpg)
rsem-prepare-reference - genome database - de novo transcriptome assembler - EST database - UCSC, Ensemble genome browser database - set of preprocessed transcript sequences append poly(A) tail sequences to reference (disabled with–no-polyA)
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Reference sequence preparation
![Page 12: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/12.jpg)
rsem-prepare-reference rsem-prepare-reference --gtf mm9.gtf \ --transcript-to-gene-map knownIsoforms.txt \ --bowtie-path /sw/bowtie \ /mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \ /ref/mouse_125 -or /mm9
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Reference sequence preparation
![Page 13: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/13.jpg)
rsem-calculate-expression
mapping(aligned) to reference calculation of relative abundances mapping tools: bowtie(default), sam format mapping condition - no single best align - mismatches in first 25 bases - reads > 200
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Read mapping ,abundance esti-mation
![Page 14: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/14.jpg)
rsem-calculate-expression fasta (position-dependent) fastq (paired-end, single-end, score) EM(expectation-maximization algorithm) options –strand-specific sense or antisense directions –fragment-length (SE) PE learns length
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Read mapping ,abundance esti-mation
![Page 15: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/15.jpg)
rsem-calculate-expression –estimate-rspd : highly 5’ or 3’ biased from position distributions –calc-ci (maximum likelihood) 95% credibility intervals : capture uncertainty posterior mean
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Read mapping ,abundance esti-mation
![Page 16: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/16.jpg)
rsem-calculate-expression - output estimate quantity - isoform-level, gene-level : used by edgeR, DESeq estimate fraction transcripts - TPM (transcripts per million) - independent, mean expressed transcript length TPM > RPKM, FPKM
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Read mapping ,abundance esti-mation
![Page 17: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/17.jpg)
rsem-calculate-expression – output
–out-bam BAM file : genome browser(alignment)
sem-bam2-wig BAM wig the expected number of reads overlapping each genomic position annotation GTF-formatted
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Visualization
![Page 18: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/18.jpg)
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Visualization
![Page 19: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/19.jpg)
rsem-plot-model rsem-calculate-expression output to pdf report learned fragment read length distributions sequencing error parameters
Li and Dewey BMC Bioinformatics 2011, 12:323
Implementation - Visualization
![Page 20: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/20.jpg)
IsoEM - transcript sequences aligned(bowtie)
Cufflinks - quantification mode genome sequence aligned(tophat)
rQuant - genome sequence aligned(tophat)
RSEM (v0.6) - transcript sequences aligned(bowtie)
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools
![Page 21: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/21.jpg)
20 million RNA-Seq (non-strand-specific, mouse transcriptome)
Paired-end reads Single-end reads throwing out the second read of each pair
reference transcript RefSeq - 20,852 genes and 1.2 isoforms per gene on
average Ensembl - 22,329 genes and 3.4 isoforms per gene on
average
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools
![Page 22: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/22.jpg)
tested methods measured accuracy
median percent error (MPE)
error fraction (EF) – 10%
false positive (FP) statistics
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools
![Page 23: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/23.jpg)
RSEM and IsoEM outperform Cufflinks and
rQuant.
1. not fully handle reads map to multiple genes - cufflinks “rescue"-like strategy one iteration of the EM algorithm - rQuant method handles gene multireads is not clear.
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools
![Page 24: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/24.jpg)
RSEM and IsoEM outperform Cufflinks and rQuant.
2. performance gap Cufflinks, rQuant – genome set RSEM and IsoEM – transcript set Cufflinks not properly short transcripts - abnormally high abundance estimates of shorter mean fragment length (280 bases)
RSEM - poly(A) tail handling but not IsoEM
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools
![Page 25: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/25.jpg)
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools
![Page 26: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/26.jpg)
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools
MPE : median percent error
EF : error fraction
FP : false positive
![Page 27: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/27.jpg)
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools
HBR: human brain referenceUHR: universal human referenceMicroarray Quality Control (MAQC)
qRT-PCR : 1,000 (5%) out of a total of 19,005 - gene : 716 filterd genes
biased towards single-isoform genes
![Page 28: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/28.jpg)
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools
![Page 29: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/29.jpg)
single-end number of reads than length at gene level optimal read length, around 25 bases in mouse and
maize
paired-end isoform, alternative splice genes
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Paired vs. single end reads
![Page 30: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/30.jpg)
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools
![Page 31: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/31.jpg)
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Comparison to related tools
mouse RefSeq
empirical : training data Profile : base-dependent
these results only for the task of quantification (We stress…)
![Page 32: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/32.jpg)
Li and Dewey BMC Bioinformatics 2011, 12:323
Results and Discussion - Availability and require-ments
Project name: RSEM
Project home page: http://deweylab.biostat.wisc.edu/rsem
Operating systems: Any POSIX-compatible platform (e.g., Linux, Mac OS X, Cygwin) Programming languages: C++, Perl
Other requirements: Pthreads; Bowtie, R
![Page 33: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome Li and Dewey BMC Bioinformatics 2011, 12:323 Kim Dong-in](https://reader031.vdocuments.net/reader031/viewer/2022013012/56649e1a5503460f94b07490/html5/thumbnails/33.jpg)
Li and Dewey BMC Bioinformatics 2011, 12:323
Conclusions
RSEM (RNASeq by Expectation Maximization)
- preforming gene, isoform level - not require a reference genome - quantification with de novo transcriptome assemblies
- visualization outputs - credibility interval (CI) estimates - userfriendly, two commands - reference transcript files - single end at gene level quntification - paired end within-gene isoform for mouse, human