tox680 unveiling the transcriptome using rna- seq
DESCRIPTION
TOX680 Unveiling the Transcriptome using RNA- seq. Jinze Liu. Outline. What is the transcriptome? Measuring the transcriptome Sampling the transcriptome using short reads Alignment of reads to a reference genome Splice graph representation of RNA- seq data - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/1.jpg)
TOX680Unveiling the Transcriptome using RNA-seq
Jinze Liu
![Page 2: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/2.jpg)
Outline
• What is the transcriptome?• Measuring the transcriptome• Sampling the transcriptome using short reads• Alignment of reads to a reference genome• Splice graph representation of RNA-seq data• Reconstructing the transcriptome• Differential analysis of the transcriptome
![Page 3: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/3.jpg)
Genome, Transcriptome, ProteomeSchematic illustration
of a eukaryotic cell
cell nucleus
DNA
RNA
The transcriptome isall RNA molecules
transcribed from DNA
Proteins
Genome
Proteome
![Page 4: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/4.jpg)
Dynamics of the Transcriptome• Cells with the same genome may produce a different transcriptome … how?
• Two main mechanisms(1) differential gene expression (2) differential gene transcription
DNA
Proteins
mRNA transcripts
DNA
mRNA
pre-mRNA
Proteins
![Page 5: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/5.jpg)
Alternate transcription• multiple mRNA transcript “isoforms” within one gene
– proteins with different functions may be produced– e.g. skipped exon in CYT-2 isoform of ERBB4 leads to increased cell proliferation
Muraoka-Cook et al. (2009) Mol Cell Biol
CYT-2: deletes 16 amino acids (WW domain binding motif)
![Page 6: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/6.jpg)
Forms of alternative splicing
Castle et al. (2008) Nature Genetics
Gene VEGFA combines multiple alternative splicing forms (not independently!) ….
2 2 23 3
![Page 7: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/7.jpg)
How to measure the transcriptome?
• Ideally, given a sample of RNA– which transcripts are present?– how much of each?
• Given two samples of RNA– which transcripts are differentially expressed?
![Page 8: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/8.jpg)
Microarrays• Most common technique for measuring
transcriptome
– hybridized probes detect the presence and abundance of specific known transcripts
• difficult to observe differenttranscript isoforms
• abundance has limited dynamic range
![Page 9: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/9.jpg)
Differential gene expression
• Identify transcriptome differences between two samples
![Page 10: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/10.jpg)
Outline
• What is the transcriptome• Measuring the transcriptome• Sampling the transcriptome using short reads• Alignment of reads to a reference genome• Splice graph representation of RNA-seq data• Reconstructing the transcriptome• Differential analysis of the transcriptome
![Page 11: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/11.jpg)
The RNA-seq protocol
Nature Review | Genetics
• Protocol– mRNA is reverse transcribed to cDNA– cDNA is randomly fragmented– adapters are added to the fragments– fragments are sequenced using HT
sequencing technology• e.g. Illumina: up to a billion 100bp
reads sequenced in a single run
• Each sequence is a randomly sampled fragment of the transcriptome
– identity determined by alignment to a transcript library or to a reference genome
– the number of alignments toa genomic locus is a measure ofabundance
![Page 12: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/12.jpg)
RNA-seq view of transcriptome
• Issues– non-random fragmentation– sequencing bias– DNA or pre-mRNA contamination
• Spliced alignments– not a problem if aligning to a transcript library– challenging if aligning to the genome
![Page 13: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/13.jpg)
Outline
• What is the transcriptome• Measuring the transcriptome• Sampling the transcriptome using short reads• Alignment of reads to a reference genome• Splice graph representation of RNA-seq data• Reconstructing the transcriptome• Differential analysis of the transcriptome
![Page 14: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/14.jpg)
Spliced alignment strategies• Annotation based discovery
– contiguous alignment of reads to existing EST/cDNA sequences with known splice junctions– contiguous alignment of reads to paired exons from database of known or suspected
junctions (Mortazavi et al. 2008, Wang et al. 2008)
• Ab initio discovery by alignment to reference genome– QPalma (Bona et al. 2008)
• supervised splice site prediction and gapped alignment algorithm for aligning spliced reads
– TopHat (Trapnell et al. 2009)• detect potential junctions based on structural features of introns, e.g. GT – AG
dinucleotide sequences flanking the exons• test alignment of reads to candidate exon pairs
![Page 15: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/15.jpg)
Improved splice detection• Issues
– Can not easily find non-canonical splices or long-range splices– Single long reads may include multiple splice junctions– Spurious alignment is a serious problem
• MapSplice: a second generation ab initio method– alignment of reads
• does not depend on any structural features• finds multiple candidate alignments
– splice inference• leverages the quality and diversity of read alignments to disambiguate
true junctions from spurious junctions– efficient and scalable
![Page 16: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/16.jpg)
Finding spliced alignments
Genome
mRNA tag Tt1 t2 t3 t4
k k hj1 j2
exon 1 exon 2 exon 3
• Example: 100 bp tag T is split into 25bp segments– segments are tested for (approximate) alignment to the genome– unaligned segments implicate splices– find splices by searching from neighboring aligned segments
• Theorem: if no exon is shorter than 2k, then at least one segment must align in every pair of consecutive length k segments.
![Page 17: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/17.jpg)
MapSplice algorithm (1)
Ti …t1 t2 tj tn
…(1) Segmentation of reads
tj tj+2
? tj+1
tj tj+1
3’5’
Contiguous
Missed alignment double anchored
tj? tj+1 Missed alignment
single anchored
tj+2
tj tj+2
? tj+1
tj
? tj+1
s(j+1)
3’5’
(2) Segment exonic alignment (3) Segment spliced alignment
tj tj+1 3’5’
…
T1
T2
Ti
INPUTSset of RNA-Seq reads
Reference genome
![Page 18: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/18.jpg)
MapSplice algorithm
(2)
OUTPUTS: Splices and splice coverage
Read alignments
3’5’
t1 t2tj tj+1 tn… …
tn-1
Ti … …
(4) Segment assembly
1. Alignment quality2. Anchor significance3. Entropy
High Confidence Low confidence
Ti2 TiTi3 Ti4
(5) Junction inference
Ti
(6) Identify best alignment for tags
Ti
3’5’
3’5’
![Page 19: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/19.jpg)
Validating the algorithm
• How can we tell if it is working well?– comparison against transcriptome library alignment
– but how do we know that novel alignments are valid?• run on synthetic transcriptome for which we know ground
truth!
BWAidentically
aligned80.4%
BWA aligned
only1.2%
MapSplicealigned
only5.0% /6.8%
unaligned 10.2%
by both81.4%
MPS
![Page 20: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/20.jpg)
Synthetic Transcriptome1. Sample each gene’s ABUNDANCE from Wang et al. (2008)
2. Choose a DISTRIBUTION across annotated transcript isoforms in RefSeq
3. Randomly pick the START position for each read (& introduce errors)
4. Align reads with MapSplice and analyze performance.
![Page 21: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/21.jpg)
MapSplice performance
![Page 22: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/22.jpg)
Improved accuracy from multiple criteria in junction classification
![Page 23: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/23.jpg)
Outline
• What is the transcriptome• Measuring the transcriptome• Sampling the transcriptome using short reads• Alignment of reads to a reference genome• Splice graph representation of RNA-seq data• Reconstructing the transcriptome• Differential analysis of the transcriptome
![Page 24: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/24.jpg)
• Transcriptome changes in response to time, disease, etc• Characteristics of a transcriptome
• Qualitatively, which transcripts are expressed• Quantitatively, what are their expression levels
Splicing Ratio
3 41 2
Protein β
transcript β3 41
Transcript Abundance
Protein Expression
Protein α
transcript α41 2 3
![Page 25: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/25.jpg)
• Transcriptome changes in response to time, disease, etc• Differential Splicing: alternative splicing events that exhibit significantly
different splicing ratios between different samples
Splicing Ratio
3 41 2
Protein β
transcript β3 41
Transcript Abundance
Protein Expression
41 2 3
Protein α
transcript α3 41
Protein β
transcript β
3 41 2
Protein α
transcript α41 2 3
Normal TumorDifferential Splicing
![Page 26: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/26.jpg)
• Differential Splicing: why important?• Understanding of cell differentiation and development• Identification of disease biomarkers
Splicing Ratio
3 41 2
Protein β
transcript β3 41
Transcript Abundance
Protein Expression
41 2 3
Protein α
transcript α3 41
Protein β
transcript β
3 41 2
Protein α
transcript α41 2 3
Normal TumorDifferential Splicing
![Page 27: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/27.jpg)
Observed read coverage
A1
A2
B1
B2
Gro
up A
Gro
up B
Splice structure E1 E2 E3 E4 E5J1 J2
J3
J4
J5Unify structural information (exons and junctions) from all samples
DiffSplice – Unified Graph Representation
RNA-seq read alignment
Reference genome5’ 3’
![Page 28: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/28.jpg)
Splice structure
Unified Expression-weighted Splice Graph (ESG)
Weighted DAG (Directed Acyclic Graph)• Vertex – Exonic segment• Edge – Splice junction• Weight – Expression level
A1
B1B2
Gro
up A
Gro
up B
A2
E1
94.9
56.1
83.7
62.2
J1
91
57
84
64
E2
95.2
55.7
88.1
65.6
E1 E2 E3 E4 E5TS TE
J1 J2
J3
J4
J5
E1 E2 E3 E4 E5J1 J2
J3
J4
J5
Differentiate samples by the weights
DiffSplice – Unified Graph Representation
![Page 29: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/29.jpg)
source sink
E1 E2 E3
J1 J2
J3
ASM1
E1 E3immed. pre-dominator
immed. post-dominator E3 TE
immed. pre-dominator
immed. post-dominator
source sink
ASM2
E3 E4 E5 TE
J4
J5
ESG
ASM
E1 E2 E3 E4 E5TS TE
J1 J2
J3
J4
J5
DiffSplice – Alternative Splicing Modules (ASMs)
![Page 30: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/30.jpg)
source sink
E1 E2 E3
J1 J2
J3
ASM1
source sink
ASM2
E3 E4 E5 TE
J4
J5
ESG
ASM
E1 E2 E3 E4 E5TS TE
J1 J2
J3
J4
J5
path 1
path 2
path 1
path 2
Level 0
Level 1
ASM1 ASM2
DiffSplice – Alternative Splicing Modules (ASMs)
![Page 31: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/31.jpg)
E1 E2 E3
J1 J2
J3
path 1
path 2
?
?
(?%)
(?%)
91
92.1
93
94.93
95.2E1 E2 E3
J1 J2
J3
ASM1 in sample A1path 1
path 2
observed expression
estimated expression
DiffSplice – Isoform Abundance Estimation
N, q
w(E1) w(E2) w(E3) w(J1) w(J2) w(J3)
T1 T2
Poisson dist’n
Normal dist’n
![Page 32: TOX680 Unveiling the Transcriptome using RNA- seq](https://reader035.vdocuments.net/reader035/viewer/2022062410/56816144550346895dd0bc9f/html5/thumbnails/32.jpg)
96.7% 3.3%
alternative path proportion
estimated expression of ASM1
95.1
JE
qqs
swPmaxargˆ
T JE
q qt s
tPoissontNormal NTfTtswf
,|||maxarg
ASM1 in sample A1
observed expression
estimated expression
DiffSplice – Isoform Abundance Estimation
E1 E2 E3
J1 J2
J3
path 1
path 2
92.0
3.1
(96.7%)
(3.3%)
91
92.1
93
94.93
95.2E1 E2 E3
J1 J2
J3
path 1
path 2