rna sequencing and data analysisrna sequencing and data analysis length of mrna transcripts in the...
TRANSCRIPT
RNA SEQUENCING AND DATA ANALYSIS
Length of mRNA transcripts in the human genome
0 2,000 4,000 6,000 8,000 10,000
0
1,000
2,000
3,000
4,000
5,000
0 200 400 600 800
5,000
4,000
3,000
0
2,000
1,000
Length of mRNA transcripts in the human genome
0 2,000 4,000 6,000 8,000 10,000
0
1,000
2,000
3,000
4,000
5,000
0 200 400 600 800
5,000
4,000
3,000
0
2,000
1,000
Insert size ~ 200bp
Overview of RNA sequencing protocol
Fwd read Reverse read
Insert
SEQUENCING
Read length: 48-76bp
Sequencing parameters
Read Depth
Minimum mapped reads: 10 million for quantitative analysis of mammalian transcriptome
More reads needed for splicing variant discovery and differential comparison among samples
Current output: 120-180 million raw reads / lane
Multiplex level: 4-12 libraries / lane recommended
All RNA is not the same
Types of RNA:
All RNA is not the same
Types of RNA:
Messenger RNA
Micro RNA
Long non-coding RNA
Ribosomal RNA
Methods for RNA enrichment prior to library construction
Poly(A)-RNA selection By hybridization to oligo-dT beads mature mRNA highly enriched efficient for quantification of gene expression level and so on limitation: 3’ bias correlating with RNA degradation
rRNA depletion: by hybridization to bead-bound rRNA probes rRNA sequence-dependent and species-specific all non-rRNA retained: premature mRNA, long non-coding RNA
Small RNA extraction: Specific kits required to retain small RNA Optional fine size-selection by gel or column
Different methods capture different types of RNA
Poly(A)-RNA
selection
rRNA depletion
Small RNA
extraction
Messenger RNA
Micro RNA
Long non-coding RNA
Ribosomal RNA
Different methods capture different types of RNA
Poly(A)-RNA
selection
rRNA depletion
Small RNA
extraction
Messenger RNA X X
Micro RNA X X
Long non-coding RNA X
Ribosomal RNA X
Paraffin embedded vs fresh frozen
Fresh Frozen
REA
D Q
UA
LITY
First step: alignment
Or: assembly, then alignment
Alignment versus assembly
Assembly
Trinity, Cufflinks, ABySS
Particularly useful when no reference genome is available, like in bacterial transcriptomes
Alignment
Bowtie, BWA, Mosaic
Maximum sensitivity, fewer false positives
RNA sequencing applications
RNA sequencing applications
Quantification of transcript expression levels
Detection of splice variation/different isoforms of the same gene
Allele specific expression levels
Strand specific expression levels
Detection of fusion transcripts (such as BCR-ABL in CML)
Detection of sequence variation (limited application)
Validation of DNA sequence variants
RNA-seq expression levels are linear where microarrays get saturated or are insensitive
Expression is measured as ‘reads per kilobase per million’ (RPKM)
or ‘fragments per kilobase of exon per million fragments mapped’
(FPKM) to normalize for gene length and library size
In GBM, the gene EGFR is frequently targeted by intragenic deletions
vIII deletion occurs in same domain as point mutations
Detecting EGFR transcript variants using RNA-seq data
SpliceSeq can detect splice variants http://bioinformatics.mdanderson.org/main/SpliceSeq:Overview
Allele-/Strand-specific RNA-seq
Haplotype specific gene expression by computationally integrating RNAseq with DNA SNP data
Strand-specific RNA-seq requires specific library preparation protocol
Costs more
Output more accurate, useful for analysis in absence of a reference genome
Identification of fusion transcripts
Popular methods search for
Read pairs that map to two different genes
Need to correct for gene homology
Reads that span fusion junction
Split reads in half and align separate halfs
Make a database of all possible fusion junctions and align full reads
PRADA, MapSplice, TopHat
http://sourceforge.net/projects/prada/
FGFR3-TACC3 fusion in GBM is the result of a local inversion
FGFR3-TACC3
Fusion transcripts are often associated with copy number difference and genomic breakpoints
Copy number profile of two FGFR3-TACC3 cases in TCGA
FGFR3-TACC3
6.4% of GBM harbors transcript fusions involving EGFR
All fusions fall within the area of the EGFR amplification
OUTPUTS
Processing Module
Read Alignment
Remap alignments
Combine two ends
Quality Scores
Recalibrated
INPUTS
.fastq files [END1 & END2]
Config.txt [location of scripts and reference files]
Preprocessing .bam file
[PAIRED END] GUESS-ft [YES|| NO|| ONLY]
RPKM & QC metrics
-geneA -geneB
Expression & QC Module
[YES|| NO|| ONLY]
Fusion Candidates
Supervised search evidence
Fusion Module [YES|| NO|| ONLY]
RNA-SeQC
Fusion Module Discordant read pair: Each end of the
read pair maps uniquely to distinct protein-coding genes.
Fusion spanning reads: Chimeric read that maps a putative junction and the mate read maps to either GENE A or GENE B.
Gene A Gene B
Structural transcript variants in low grade glioma
RNA-seq data from 272 TCGA low grade glioma
Fusion detection accuracy affected by:
PRADA detected 1,843 fusion transcripts
#mapped
reads per
sample
Detected #fusion transcripts per sample
Filtering out artifacts
Homology E value larger than 0.01 (column Evalue)
No mismatches in junction spanning reads
Count the number of partner genes for each individual gene
Identify genes with fusions mapping to more than 10 different chromosome arms
970/1,843 fusions filtered
Validation of predicted transcript fusions
509/970 fusions filtered
Define four tiers of fusion transcripts based on evidence
Tier 1: At least 3 discordant read pairs (DSP), two perfect match junction spanning reads (JSR), and both partner genes only fused to one other partner gene in the same sample
Tier 2: At least 2 DSP and 1 JSR, with a DNA breakpoint within 100kb window
Use matching DNA copy number profile
Tier 3: At least 2 DSP and 1 JSR, unique partner genes, with predicted junction consistent for all
Tier 4: The rest
Validation of RNA fusions using output of BreakDancer
BreakDancer detects DNA rearrangements in low pass sequencing data
Validation of RNA fusions using output of BreakDancer
BreakDancer detects DNA rearrangements in low pass sequencing data
Variant detection
Approximately 30% of mutations are covered sufficiently to
be detected at a validation rate of ~ 80%.
From TCGA renal
cell clear cell
carcinoma project
Reverse transcriptase step to convert RNA to cDNA complicates
detection of RNA edits and mutations
RNA sequencing read alignment in PRADA
Transcripts from same gene
Reads are aligned to all possible transcripts
Reads are also aligned to genome
RNA sequencing read alignment in PRADA
Reads are aligned to all possible transcripts
Reads are also aligned to genome
Final and single placement for
each read it determined by
re-mapping
PRADA alignments – advantages versus disadvantages
Advantage:
Alignment to DNA means mapping of unannotated transcripts
Alignment to transcriptome means mapping across exon-exon junctions
Disadvantage
More conservative alignment than split-read
PRADA focuses on the analysis of paired-end RNA-sequencing data.
Four modules: 1. Processing
2. Expression and Quality Control
3. Gene fusion
4. GUESS-ft: General User dEfined Supervised Search for fusion transcripts
Processing Module
Read Alignment
Remap alignments
Combine two ends
Quality Scores
Recalibrated
INPUTS
.fastq files [END1 & END2]
Config.txt [location of scripts and reference files]
Preprocessing .bam file
[PAIRED END] GUESS-ft [YES|| NO|| ONLY]
RPKM & QC metrics
-geneA -geneB
Expression & QC Module
[YES|| NO|| ONLY]
Fusion Candidates
Supervised search evidence
Fusion Module [YES|| NO|| ONLY]
RNA-SeQC
OUTPUTS
http://sourceforge.net/projects/prada/
OUTPUTS
Processing Module
Read Alignment
Remap alignments
Combine two ends
Quality Scores
Recalibrated
INPUTS
.fastq files [END1 & END2]
Config.txt [location of scripts and reference files]
Preprocessing .bam file
[PAIRED END] GUESS-ft [YES|| NO|| ONLY]
RPKM & QC metrics
-geneA -geneB
Expression & QC Module
[YES|| NO|| ONLY]
Fusion Candidates
Supervised search evidence
Fusion Module [YES|| NO|| ONLY]
RNA-SeQC
Expression & QC Module RNA-SeQC provides three types of
quality control metrics: Read Counts
Coverage
Correlation
RPKM Values at transcript level
For longest transcript
RNAseQC Process (java)
OUTPUTS
Processing Module
Read Alignment
Remap alignments
Combine two ends
Quality Scores
Recalibrated
INPUTS
.fastq files [END1 & END2]
Config.txt [location of scripts and reference files]
Preprocessing .bam file
[PAIRED END] GUESS-ft [YES|| NO|| ONLY]
RPKM & QC metrics
-geneA -geneB
Expression & QC Module
[YES|| NO|| ONLY]
Fusion Candidates
Supervised search evidence
Fusion Module [YES|| NO|| ONLY]
RNA-SeQC
Fusion Module Discordant read pair: Each end of the
read pair maps uniquely to distinct protein-coding genes.
Fusion spanning reads: Chimeric read that maps a putative junction and the mate read maps to either GENE A or GENE B.
Gene A Gene B
Implementation Results Samples processed
>400 KIRC
>170 GBM
Processing Module
Read Alignment
Remap alignments
Combine two ends
Quality Scores
Recalibrated
INPUTS
.fastq files [END1 & END2]
Config.txt [location of scripts and reference files]
Preprocessing .bam file
[PAIRED END] GUESS-ft [YES|| NO|| ONLY]
RPKM & QC metrics
-geneA -geneB
Expression & QC Module
[YES|| NO|| ONLY]
Fusion Candidates
Supervised search evidence
Fusion Module [YES|| NO|| ONLY]
OUTPUTS
RNA-SeQC
Works well in MDACC HPC* system
PRADA-fusion module validation rate ~85 % (53 out of 62)
RNA sequencing in The Cancer Genome Atlas
mRNA: poly-A mRNA purified from total RNA using poly-T oligo-attached magnetic beads
miRNA: Total RNA is mixed with oligo(dT) MicroBeads and loaded into MACS column, which is then placed on a MultiMACS separator. From the flow-through, small RNAs, including miRNAs, are recovered by ethanol precipitation.
Detecting fusion transcripts in GBM
KIRC fusion results
We analyzed 416 RNA-seq samples from clear cell renal carcinoma (ccRCC), available through TCGA.
We identified 80 bona-fide fusion transcripts, 57 intrachromosomal
33 interchromosomal
in 62 individual samples
“Recurrent” fusions SFPQ-TFE3 (n=5, chr1-chrX)
DHX33-NLRP1 (n=2, chr2)
TRIP12-SLC16A14 (n=2, chr17)
TFG-GRP128 (n=4, chr3)
KIRC fusion validation
Sample ID 5’ Gene 3’ Gene Discordant Read Pairs
Fusion Span Reads
Fusion Junction (s)
5’ Gene Chr
3’ Gene Chr
Validated?
TCGA-AK-3456-01A-02R-1325-07 TFE3 SFPQ 175 129 1 chrX chr1 Yes
TCGA-AK-3456-01A-02R-1325-07 SFPQ TFE3 116 81 1 chr1 chrX Yes
TCGA-A3-3313-01A-02R-1325-07 C6orf106 LRRC1 90 40 2 chr6 chr6 Yes
TCGA-A3-3313-01A-02R-1325-07 CYP39A1 LEMD2 37 9 1 chr6 chr6 Yes
TCGA-B2-4101-01A-02R-1277-07 FAM172A FHIT 17 4 1 chr5 chr3 Yes
TCGA-AK-3445-01A-02R-1277-07 KIAA0802 LRRC41 14 6 1 chr18 chr1 Yes
TCGA-B0-5095-01A-01R-1420-07 GORASP2 WIPF1 14 2 1 chr2 chr2 Yes
TCGA-A3-3313-01A-02R-1325-07 ZNF193 MRPS18A 11 3 1 chr6 chr6 Yes
TCGA-A3-3313-01A-02R-1325-07 FTSJD2 GPX6 9 8 1 chr6 chr6 Yes
TCGA-B0-4945-01A-01R-1420-07 KIAA0427 GRM4 8 5 1 chr18 chr6 No
TCGA-B8-4143-01A-01R-1188-07 SLC36A1 TTC37 5 5 1 chr5 chr5 No
PRADA-fusion module validation rate (11 out of 13) ~85% RT-PCR and FISH assays
TFE3-SFPQ was validated in three individual samples
KIRC fusion validation: RT-PCR
FAM172A-FHIT
(a) (b)
Figure 2. RT-PCR results for TFE3 fusion validations for sample TCGA-AK-3456. SFPQ-TFE3
(a) (b)
Figure 2. RT-PCR results for TFE3 fusion validations for sample TCGA-AK-3456. TFE3-SFPQ
KIRC fusion results
We analyzed 416 RNA-seq samples from clear cell renal carcinoma (ccRCC), available through TCGA.
We identified 80 bona-fide fusion transcripts, 57 intrachromosomal
33 interchromosomal
in 62 individual samples
“Recurrent” fusions SFPQ-TFE3 (n=5, chr1-chrX)
DHX33-NLRP1 (n=2, chr2)
TRIP12-SLC16A14 (n=2, chr17)
TFG-GRP128 (n=4, chr3)
TFG-GRP128 has been reported in other cancers
TFG-GRP128 has been reported in other cancers
TFG-GRP128 has been reported in other cancers
TCGA has 1,000s of RNA seq samples - how
can we quickly scan many samples for the
presence of this fusion?
Processing Module
Read Alignment
Remap alignments
Combine two ends
Quality Scores
Recalibrated
INPUTS
.fastq files [END1 & END2]
Config.txt [location of scripts and reference files]
Preprocessing .bam file
[PAIRED END] GUESS-ft [YES|| NO|| ONLY]
RPKM & QC metrics
-geneA -geneB
Expression & QC Module
[YES|| NO|| ONLY]
Fusion Candidates
Supervised search evidence
Fusion Module [YES|| NO|| ONLY]
OUTPUTS
RNA-SeQC
Supervised Search Module GUESS-ft: General User dEfined Supervised
Search for fusion transcripts
BAM
GUESS-ft
Mapped to A
or B
A-B
Discordant
reads
Unmapped
reads
Junction DB
Junction
spanning reads
Summary
report
Use high quality
mapping reads
only, Checks
read
orientation
fulfills fusion
schema, allow
up to one
mismatch.
Two read ends
map to A and B
respectively
Parse
Unmapped
reads with the
other end
mapping to A
or B
Map parsed
reads to DB of
all possible
exon junctions
List reads with
one end map
to junction, the
other map to A
or B
Time consuming step
Identification of TFG-GRP128 fusion
All available normal samples in cghub
Subset of tumor samples selected based on RPKM expression pattern
Table. Samples across cancer types
Cancer Type # of normal
samples
# of tumor
samples
Bladder Urothelial Carcinoma [BLCA] 0 (0%) 2 (3.6%)
Breast invasive carcinoma [BRCA] 1 (0.94%) 13 (1.6%)
Head and Neck squamous cell carcinoma [HNSC] 0 (0%) 6 (2.3%)
Kidney renal clear cell carcinoma [KIRC] 1 (1.5%) 5 (1.2%)
Kidney renal papillary cell carcinoma [KIRP] 0 (0%) 1 (5.9%)
Liver hepatocellular carcinoma [LIHC] 0 (0%) 1 (5.9%)
Lung adenocarcinoma [LUAD] 0 (0%) 1 (0.79%)
Lung squamous cell carcinoma [LUSC] 0 (0%) 9 (4%)
Prostate adenocarcinoma [PRAD] 1 (14.3) 2 (1.9%)
Thyroid carcinoma [THCA] 0 (0%) 2 (0.89%)
* All performed by PRADA fusion module.
Tumors with the fusion have higher GPR128 expression levels
RPKM expression pattern seen in KIRC tumors
Fusion sample(s)
Higher expression of GPR128 (activation)
TCGA-B0-5703 w/ 1 discordant read pair in tumor sample w/ 33 discordant read pair in matched normal