PowerPoint-Prsentation
Systematic evaluation
of spliced alignment programs
for RNA-seq data
Engstrm et al. (Nature Methods 2013)Presented by Monica Drgan
/home/monique/Desktop/ETH_alignment_MDragan.odp
Systematic evaluation
of spliced alignment programs
for RNA-seq data
Systematic evaluation
of spliced alignment programs
for RNA-seq data
Systematic evaluation
of spliced alignment programs
for RNA-seq data
bioinformatics.ca
Mapping the reads to a reference genome or
a transcriptome database
Deep sequencing (with NGS)
Systematic evaluation
of spliced alignment programs
for RNA-seq data
bioinformatics.ca
Why RNA sequencing?
Functional studies
Gene prediction is difficult
Systematic evaluation
of spliced alignment programs
for RNA-seq data
Systematic evaluation
of spliced alignment programs
for RNA-seq data
Mapping strategies depend on read length
Read length < 50 bp
Read length > 50 bp
Systematic evaluation
of spliced alignment programs
for RNA-seq data
Mapping strategies depend on read length
Read length < 50 bp Short (Unspliced) aligners
Read length > 50 bp
BWABOWTIE
Systematic evaluation
of spliced alignment programs
for RNA-seq data
Mapping strategies depend on read length
Read length < 50 bp Short (Unspliced) aligners
Read length > 50 bp Spliced alignment programs In mRNA sequences the introns were removed
BWABOWTIEGSNAPMapSpliceSTARPAL MapperTopHatReadsMapPASSSMALT
Outline
Challenges in RNA sequence alignment
The aim of this paper
Existing spliced-alignment software
Conclusions
Outline
Challenges in RNA sequence alignment
The aim of this paper
Existing spliced-alignment software
Conclusions
Challenges in RNA-seq alignment
Large #reads
Challenges in RNA-seq alignment
Large #reads ~100M = computationally expensive
Challenges in RNA-seq alignment
Large #reads ~100M = computationally expensive
Compression with Burrows-Wheeler Transform
Challenges in RNA-seq alignment
Large #reads
RNA Splicing
RNA Splicing Introns - mRNA transcripts do not include these introns, so the alignment program must handle gapped (or spliced) alignment with very large gaps
Challenges in RNA-seq alignment
Large #reads
RNA Splicing
RNA Splicing Introns - mRNA transcripts do not include these introns, so the alignment program must handle gapped (or spliced) alignment with very large gaps
Challenges in RNA-seq alignment
Large #reads
RNA Splicing / Alternative splicing
RNA Splicing Introns - mRNA transcripts do not include these introns, so the alignment program must handle gapped (or spliced) alignment with very large gaps
Challenges in RNA-seq alignment
Large #reads
RNA Splicing / Alternative splicing
a single gene may code for multiple proteins
RNA Splicing Introns - mRNA transcripts do not include these introns, so the alignment program must handle gapped (or spliced) alignment with very large gaps
Challenges in RNA-seq alignment
Large #reads
RNA Splicing / Alternative splicing
Paired read separation issue
RNA Splicing Introns - mRNA transcripts do not include these introns, so the alignment program must handle gapped (or spliced) alignment with very large gaps
Challenges in RNA-seq alignment
Large #reads
RNA Splicing / Alternative splicing
Paired read separation issue
RNA Splicing Introns - mRNA transcripts do not include these introns, so the alignment program must handle gapped (or spliced) alignment with very large gaps
Challenges in RNA-seq alignment
Large #reads
RNA Splicing / Alternative splicing
Paired read separation issue
Pseudogenes
RNA Splicing Introns - mRNA transcripts do not include these introns, so the alignment program must handle gapped (or spliced) alignment with very large gaps
Challenges in RNA-seq alignment
Large #reads
RNA Splicing / Alternative splicing
Paired read separation issue
Pseudogenespseudogenes often have highly similar sequences to functional, intron-containing genes RNA reads can incorrectly be mapped here
the human genome, which contains over 14,000 pseudogenes [Pei et al. Genome Biol 2012]
RNA Splicing Introns - mRNA transcripts do not include these introns, so the alignment program must handle gapped (or spliced) alignment with very large gaps
Challenges in RNA-seq alignment
Large #reads
RNA Splicing / Alternative splicing
Paired read separation issue
Pseudogenes
Duplications
RNA Splicing Introns - mRNA transcripts do not include these introns, so the alignment program must handle gapped (or spliced) alignment with very large gaps
Challenges in RNA-seq alignment
Large #reads
RNA Splicing / Alternative splicing
Paired read separation issue
Pseudogenes
Duplicationsmay correspond to biased PCR amplification of particular fragments
RNA Splicing Introns - mRNA transcripts do not include these introns, so the alignment program must handle gapped (or spliced) alignment with very large gaps
Outline
Challenges in RNA sequence alignment
The aim of this paper
Existing spliced-alignment software
Conclusions
The aim of this paper
Asses the performance of 26 RNA seq alignment protocols based on 11 programs on real and simulated human and mouse transcriptomes
Alignment protocols were evaluated on Illumina 76-nucleotide
paired-end RNA-seq data from: the human leukemia cell line K562 (1.3 109 reads)
mouse brain (1.1 108 reads) and two simulated
Outline
Challenges in RNA sequence alignment
The aim of this paper
Existing spliced-alignment softwareTopHat
MapSplice
STAR
GSNAP
Conclusions
unspliced alignment
TopHat
Trapnell, Pachter, and Salzberg (2009)
unspliced alignment
- reads that map to more than 10 locations- reads that have more than a few mismatches
TopHat
Trapnell, Pachter, and Salzberg (2009)
unspliced alignment
assemble
islands of sequences
- reads that map to more than 10 locations- reads that have more than a few mismatches
TopHat
Trapnell, Pachter, and Salzberg (2009)
unspliced alignment
assemble
Such an approach will identify only known or predicted combinations of exons
TopHat
Trapnell, Pachter, and Salzberg (2009)
TopHat
Trapnell, Pachter, and Salzberg (2009)
unspliced alignment
spliced alignment
TopHat
Trapnell, Pachter, and Salzberg (2009)
TopHat
Trapnell, Pachter, and Salzberg (2009)
Known junction signals:GT-AG, GC-AG, and AT-AC
TopHat
Trapnell, Pachter, and Salzberg (2009)
If an alignment extends into an intron region, realign the reads to the adjacent exons instead
Known junction signals:GT-AG, GC-AG, and AT-AC
Outline
Challenges in sequence alignment
What the paper is about
Existing softwareTopHat
MapSplice
STAR
GSNAP
Conclusions
Future work
MapSplice
Wang et al. (2010)
Similar to TopMap
Reads = tags
A tag has an exonic alignment if it can be aligned in its entirety to a consecutive sequence of nucleotides in G.
T has a spliced alignment if its alignment to G Requires one or more gaps
MapSplice
Wang et al. (2010)
Step 1: exonic alignment
MapSplice
Wang et al. (2010)
Step 2: spliced alignment
the spliced alignment of tj+1
to the genomic interval betweenanchors tj and tj+2
consider all the possible positions of the splice site and map according to the Hamming distace
MapSplice
Wang et al. (2010)
Step 3: merge candidate segment alignments
Outline
Challenges in sequence alignment
What the paper is about
Existing softwareTopHat
MapSplice
STAR
GSNAP
Conclusions
Future work
STAR
Dobin et al. (2012)
Maximal Mappable Prefix (read location i) = the longest read substring from position i that has exact match on one or more substrings of the ref genome
poor genomic alignment
Detect: (a) splice junctions(b) mismatches(c) tails
Outline
Challenges in sequence alignment
What the paper is about
Existing softwareTopHat
MapSplice
STAR
GSNAP
Conclusions
Future work
GSNAP
Wu and Nacu (2010)
Efficient detection of indels and splice pairs:
For large genomes, it is more efficient to preprocess the genome rather than the reads to create genomic index files, which provide genomic positions for a given prefix/suffix.
Works with candidate regions in the ref genome. (keep track of the read location of 12 residues that support each candidate region)
GSNAP
Wu and Nacu (2010)
For a more powerful use of the algorithms:
use of available gene annotations, which allow it to avoid erroneously mapping reads to pseudogenes
use the information about the pair sof the paired read
Outline
Challenges in RNA sequence alignment
The aim of this paper
Existing spliced-alignment software
Conclusions
Conclusions
Mismatches and basewise accuracy
MapSplice, PASS and TopHat display a low tolerance for mismatches. Consequently, a large proportion of reads with low base-call quality scores were not mapped by these methods
Conclusions
Mismatches and basewise accuracy
GSNAP, GSTRUCT, MapSplice,PASS, SMALT and STAR allow missmatches an can also output an incomplete alignment when they are unable to map an entire sequence
Conclusions
Mismatches and basewise accuracy
Reads from mouse were mapped (against the mouse reference assembly17) at a greater rate and with fewer mismatches than those from K562 (the cancer cell line K562 accumulated a lot of mutations with respect to the human reference assembly).
Conclusions
Indel frequency and accuracy
.
GSTRUCT produced the most uniform
distribution of indels
(coefficient of variation (CV) = 0.32) TopHat produced the most variable distribution
(CV = 1.5 and 1.1 splice junctions)
Size distribution of indelsfor the human K562 data set
Precision and recall, stratified by indel size
GEM and PALMapper output included more indels than any other method
Conclusions
Indel frequency and accuracy
GEM and PALMapper report many false indels (precision)
GSNAP and GSTRUCT exhibit high sensitivity for deletions, independent of size (recall)
TopHat2 protocol is the most
sensitive method for long insertions (recall)
Precision and recall, stratified by indel size
Conclusions
Spliced alignment
High accuracy discovery rate for ReadsMap, GSNAP, GSTRUCT and MapSplice and TopHat
#false junction calls was greatly reduced if junctions were filtered by supporting alignment counts (plot c)
Protocols using annotation recovered nearly all of the known junctions in expressed transcripts (plot d)
For novel-junction discovery, GSTRUCT outperformed other methods
Conclusions
GSNAP, GSTRUCT, MapSplice and STAR compared favorably to the other methods
MapSplice seems to be a conservative aligner with respect to mismatch frequency, indel and exon junction calls.
The most significant issue with GSNAP, GSTRUCT and STAR is the presence of many false exon junctions in the output.
Both GSNAP and GSTRUCT require considerable computing time when parameterized for sensitive spliced alignment
Thank you!
Remaining challenges:
Remaining challenges include exploiting gene annotation with-
out introducing bias, correctly placing multimapped reads, achiev-
ing optimal yet fast alignment around gaps and mismatches, and
Analysis
reducing the number of false exon junctions reported. Ongoing
developments in sequencing technology will demand efficient
processing of longer reads with higher error rates and will require
more extensive spliced alignment as reads span multiple exon
junctions. We expect performance of the aligners evaluated
here to improve as current shortfalls are addressed. Differential
treatment of these issues will enhance and expand the range of
RNA-seq aligners suited to varied computational methodologies
and analysis aims.
Some RNA-seq aligners, including GSNAP [5], RUM [6], and STAR [7], map reads independently of the alignments of other reads, which may explain their lower sensitivity for these spliced reads
GSNAP [5] and STAR [7] also make use of annotation, although they use it in a more limited fashion in order to detect splice sites
have shown how suffix arrays (Manber
and Myers, 1990), compressed using a Burrows-Wheeler Transform
(BWT) (Burrows and Wheeler, 1994), can rapidly map reads that
are exact matches or have a few mismatches or short insertions or
deletions (indels) relative to the reference.
A third approach, provided by the QPALMA program (Bona
et al., 2008), can align individual reads across exonexon junctions
using SmithWaterman-type alignments and a specifically trained
splice site model.
Klicken Sie, um das Format des Titeltextes zu bearbeiten
Klicken Sie, um die Formate des Gliederungstextes zu bearbeitenZweite GliederungsebeneDritte GliederungsebeneVierte Gliederungsebene
Functional Genomics, SS2014
Montag, 10. Mrz 2014
Klicken Sie, um das Format des Titeltextes zu bearbeiten
Klicken Sie, um die Formate des Gliederungstextes bearbeiten
Montag, 10. Mrz 2014
Departement/Institut/Gruppe
Klicken Sie, um das Format des Titeltextes zu bearbeiten
Klicken Sie, um die Formate des Gliederungstextes zu bearbeitenZweite GliederungsebeneDritte GliederungsebeneVierte Gliederungsebene