rna-seq from a bioinformatics perspective · fusion transcripts snv / indels novel transcripts....

55
RNA-seq from a bioinformatics perspective Harmen van de Werken Erasmus MC; Cancer Computational Biology Center (CCBC)

Upload: others

Post on 02-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

RNA-seq from a bioinformatics perspective

Harmen van de WerkenErasmus MC;

Cancer Computational Biology Center (CCBC)

Page 2: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Outlook

RNA seq software + RNA-seq courses

Alternative splicing &Promoters

IntroductionRNAseq data

Differential expression

Read-Through & Fusion Transcripts

SNV / InDels

Novel Transcripts

Page 3: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Table 6-1 Molecular Biology of the Cell (© Garland Science 2008)

Page 4: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

THE HUMAN GENOME▪ Consensus ~ 22,500 protein-coding genes

~ 9,000 long non-coding RNAs ~ 2,500 – 3,000 small RNAs

▪ miTranscriptome1 ~ 91,013 genes ~ 58,648 lncRNA genes

1Iyer MK et al. Nature Genetics 47, 199–208 (2015)

Transcriptomics of (Cancer) Tissue

Page 5: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Common mRNA-seq Workflow

Dry labBioinformatics

Wetlab

Page 6: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

(c)DNA Next Generation Sequencing (NGS)

ThermoFisher Ion Torrent

Personal Genome Machine (PGM)

PACBio RS IIIllumina HiSeq 2000

Page 7: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Illumina HiSeq 2000

Page 8: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Illumina Sequencing

Page 9: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Ion Torrent Platform

Page 10: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

RNAseq Data Analysis

Alternative splicing &PromotersRNAseq data

Differential expression

Read-Through & Fusion Transcripts

SNV / InDels

Novel Transcripts

Page 11: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Detecting Single Nucleotide Variants and small indels

Page 12: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Errors occur at each stage

Primary Analysis- Incorrect base calling- Homopolymer errors- Phasing

Page 13: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Errors occur at each stage

Secondary AnalysisRead mapping- Incorrect ref. Sequence- Pseudogenes- Indels- Complex variants

Page 14: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Errors occur at each stage Secondary AnalysisVariant calling

Variant Calling filters are heuristics; therefore, they will generate falsenegatives and positives and are best applied as soft filters.

Page 15: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Errors occur at each stage

Tertiary Analysis

- Incorrect gene annotation

- Contamination in reference Databases.

Page 16: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME
Page 17: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

False Negative: c.2237_2259del,insCCAACAAGGAAEGFR

Page 18: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

False Negative BRAF p.V600R; False Positives BRAF p.V600G & p.V600M

Page 19: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

RNAseq Data Analysis

Alternative splicing &PromotersRNAseq data

Differential expression

Read-Through & Fusion Transcripts

SNV / InDels

Novel Transcripts

Page 20: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Differential Gene Expression mRNA-seq Workflow

Page 21: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Fig1. FastQC report on Base Quality of position and overrepresented sequences

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345

GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC

+SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345

IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC

Fig2. Fastq format of one read

Page 22: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

mRNA-seq alignment

Courtesy: Wikipedia

Page 23: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Alignment to transcriptome

Alignment to reference genome

Page 24: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

RNA-Seq - Alignment

Alignment algorithms need:• Reference sequence• Transcriptome database (optional)

Algorithms commonly used for RNA-Seq alignment:• Tophat • STAR• HISAT2

Page 25: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Visualization of NGS Transcriptomics and Genomics data

Page 26: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

RNA-Seq - Alignment/QC

Page 27: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

RNA-Seq - Stranded

Page 28: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Differential expression

Rakesh Kaundal et al.

Page 29: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Normalization of RNA-seq

Total count (TC): Gene counts are divided by the total number of mapped readsUpper Quartile (UQ): Very similar in principle to TC, the total counts are replaced by the upper quartile of countsMedian (Med): Also similar to TC, the total counts are replaced by the median counts Trimmed Mean of M-values (TMM): This normalization method is implemented in the edgeR Bioconductor package (Robinson et al., 2010).Quantile (Q): First proposed in the context of microarray data, this normalization method consists in matching distributions of gene counts across lanes.

Reads Per Kilobase per Million mapped reads (RPKM): This approach quantifies gene expression from RNA-Seq data by normalizing for the total transcript length and the number of sequencing reads.

Page 30: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Reduce Dimensions PCA /QC

Figure 1 Principal components Analysis (PCA) of a multivariate Gaussian distribution. PCA is a linear algorithm. It will not be able to interpret complex polynomial relationship between features.

Figure 2. Principal Component Analysis (PCA) of Colon and Ovarian Cancer cell lines.

Page 31: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Reduce Dimensions t-SNEFigure 1.t-Stochastic Neighbor Embedding (t-SNE) is a non-linear algorithm of Colon and Ovarian Cancer cell lines.

Page 32: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Clustering Analysis/ QC

Fig 1: Example Hierarchical Clustering.Example of hierarchical clustering: clusters are consecutively merged with the most nearby clusters. The length of the vertical dendogram-lines reflect the nearness. (Jansen et al.)

Page 33: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Clustering Analysis/ QC

Figure 1.Hierarchical clustering of Colon and Ovarian Cancer cell lines.

Page 34: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Differential expression DESeq2

A common difficulty in the analysis of read count data is the strong variance of Log Fold Change (LFC) estimates for genes with low read count.

Page 35: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Differential expression of genes

Test Differentially gene expression with correction for multiple testing

Page 36: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Gene Set Enrichment Analysis: GO and KEGG database

Page 37: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Gene Set Enrichment Analysis: GO and KEGG database

Page 38: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

RNAseq Data Analysis

Alternative splicing &PromotersRNAseq data

Differential expression

Read-Through & Fusion Transcripts

SNV / InDels

Novel Transcripts

Page 39: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Fusion Gene Detection

Fig1 RNA-seq mapping of short reads over exon-exon junctions, it could be defined a Trans or a Cis event. (wikipedia)

Page 40: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Fusion Gene Detection

Page 41: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Fusion Gene Detection

Page 42: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Fusion Catcher Tool

Fusion Catcher outperforms other toolsby using multiple Aligners

● Bowtie● Bowtie2● BLAT● STAR

Page 43: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

RNAseq Data Analysis

Alternative splicing &PromotersRNAseq data

Differential expression

Read-Through & Fusion Transcripts

SNV / InDels

Novel Transcripts

Page 44: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

RNA-seq de novo Assembly

● Define the whole transcriptome without a reference.

● Trinity

Page 45: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

RNA-seq Analysis Software

Page 46: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

RNA-seq Molmed Courses

● Basic Course on 'R' ● Galaxy for NGS ● Workshop Ingenuity Pathway Analysis (IPA) + CLC

Workbench / Ingenuity Variant Analysis● Gene expression data analysis using R: How to make

sense out of your RNA-Seq/microarray data

Page 47: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Take Home Message

Think before you start

Page 48: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Thank you for your attention

Hematology

Mathijs A. SandersRemco Hoogenboezem

CCBC

Job van RietWesley van de Geer

Page 49: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

[email protected]://ccbc.erasmusmc.nl

@ErasmusMC_CCBC

Harmen van de Werken

Cancer Computational Biology Center (CCBC)

Page 50: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Page 51: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

● Negative Binomial● Models count as ‘binomial successes until a set number of failures’ which better fits

the RNA-Seq fragment generation (limited reagent)● Allows/captures the ‘overdispersion’ seen in RNA-Seq experiments.

Page 52: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Outline••

••

••

••

••••

Page 53: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

RNA-Seq - Why•••••••

Page 54: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

RNA-Seq - Differential splicing

Page 55: RNA-seq from a bioinformatics perspective · Fusion Transcripts SNV / InDels Novel Transcripts. Table 6-1 Molecular Biology of the Cell (© Garland Science 2008) THE HUMAN GENOME

Conclusion•

•••• ••••

•••