rna-seq: general concept, goal and experimental design - part 1

75
Defining the goal of RNA-seq analysis for differential expression Joachim Jacob 20 and 27 January 2014 This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.

Upload: bits

Post on 06-Dec-2014

657 views

Category:

Technology


3 download

DESCRIPTION

First part of the presentation slides of 'RNA-seq for DE analysis.'. See http://www.bits.vib.be for more information.

TRANSCRIPT

Page 1: RNA-seq: general concept, goal and experimental design - part 1

Defining the goal of RNA-seq analysis for differential expressionJoachim Jacob20 and 27 January 2014

This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.

Page 2: RNA-seq: general concept, goal and experimental design - part 1

Great power comes with great responsibility

RNA-seq enables one to

1) get an idea which are all active genes

2) quantify expression of each transcript

3) quantify alternative splicing

… (use your imagination)

Principles of transcriptome analysis and gene expression quantification: an RNA-seq tutorial. http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12109/abstract

Page 3: RNA-seq: general concept, goal and experimental design - part 1

Great power comes with great responsibility

You can't do all

RNA-seq is powerful, we have to aim for a certain goal.

Our goal is to detect differential expression

on the gene level.

Page 4: RNA-seq: general concept, goal and experimental design - part 1

Differential expression: useful?

What are we looking for? Explanations of observed phenotypes

yeast

GDA

Yeast mutant

GDA + vit C

why?

Page 5: RNA-seq: general concept, goal and experimental design - part 1

The central dogma

yeast

GDA

Yeast mutant

GDA + vit C

?

causes the phenotypic differences

Page 6: RNA-seq: general concept, goal and experimental design - part 1

The central dogma

yeast

GDA

Yeast mutant

GDA + vit C

?

Difference in protein activitycauses the phenotypic differences

Page 7: RNA-seq: general concept, goal and experimental design - part 1

The central dogma

yeast

GDA

Yeast mutant

GDA + vit C

?

Presence/concentration of proteins in a cellcauses the phenotypic differences

Page 8: RNA-seq: general concept, goal and experimental design - part 1

The central dogma

yeast

GDA

Yeast mutant

GDA + vit C

?

Level of protein productioncauses the phenotypic differences

Page 9: RNA-seq: general concept, goal and experimental design - part 1

The central dogma

yeast

GDA

Yeast mutant

GDA + vit C

?

Level of templates for protein productioncauses the phenotypic differences

Page 10: RNA-seq: general concept, goal and experimental design - part 1

The central dogma

yeast

GDA

Yeast mutant

GDA + vit C

?

Level of mRNA copiescauses the phenotypic differences

Page 11: RNA-seq: general concept, goal and experimental design - part 1

Does it hold?

Difference in protein activity

Level of mRNA copies

Level of templates for protein production

Level of protein production

Presence/concentration of proteins in a cell

Phenotype

Page 12: RNA-seq: general concept, goal and experimental design - part 1

Problem reduction

We can measure mRNA levels (much easier than protein levels).

So we measure mRNA.

The level of mRNA is a proxy of the level of protein activity causing the aberrant phenotype.

Page 13: RNA-seq: general concept, goal and experimental design - part 1

How to measure mRNA

1. Q-PCR (real-time)

2. Microarray

3. RNA-seq

A lot of work to measure few genes, in a relatively wide array of tissues. Very accurate.

Easier way to measure many predefined genes in a relatively wide array of tissues. Robust.

Page 14: RNA-seq: general concept, goal and experimental design - part 1

RNA-seq protocol in a nut shell

● Get your sample● Lyse the cells and extract RNA● Convert the RNA to cDNA● The cDNA pool get sequenced

The result is sequence information from scratch. No prior information is needed.

Yeast sample

Comprehensive comparative analysis of strand-specific RNA sequencing methods http://www.nature.com/nmeth/journal/v7/n9/full/nmeth.1491.html

Comparative analysis of RNA sequencing methods for degraded or low-input sampleshttp://www.nature.com/nmeth/journal/v10/n7/full/nmeth.2483.html

Page 15: RNA-seq: general concept, goal and experimental design - part 1

The predecessors of RNA-seq

● ESTs: expressed sequence tags, ideal for discovery of new genes.

● SAGE: serial analysis of gene expression, measurement of number of copies of mRNA

http://www.montana.edu/observatory/people/mcdermottlab.html

Page 16: RNA-seq: general concept, goal and experimental design - part 1

The predecessors of RNA-seq

● ESTs: expressed sequence tags, ideal for discovery of new genes.

● SAGE: serial analysis of gene expression, measurement of number of copies of mRNA

http://www.sagenet.org/findings/index.html

Page 17: RNA-seq: general concept, goal and experimental design - part 1

The predecessors of RNA-seq

● ESTs: expressed sequence tags● SAGE: serial analysis of gene expression

Low throughput: long sequence information, but for only ~thousands of genes.

Page 18: RNA-seq: general concept, goal and experimental design - part 1

Concept of measuring with RNA-seq

Extract mRNAand turn into cDNA

Fragment, ligateadaptor, amplify.

Put a fraction of the pool on sequencer to read fragments.

One template of protein production

Figure: All things must pass: contrasts and commonalities in eukaryotic and bacterial mRNA decay, Nature Reviews Molecular Cell Biology 11, 467–478

GeneA GeneB GeneC

Page 19: RNA-seq: general concept, goal and experimental design - part 1

RNA-seq protocol in a nut shell

Yeast sample

Page 20: RNA-seq: general concept, goal and experimental design - part 1

So many steps must fail our assumption

Phenotype

Proteins

mRNA levels

cDNA pool

RNA-seq reads

Represent the cDNA pool we've created

Represent the RNA pool we've extracted

Are a proxy for protein activity

Define the phenotype

Page 21: RNA-seq: general concept, goal and experimental design - part 1

So many steps must fail our assumption

Phenotype

Proteins

mRNA levels

cDNA pool

RNA-seq reads

Protein activity is regulated:Fosforylation, ubiquitination,...

mRNA templates havedifferent speeds of protein pro-Duction: availability of tRNAs, rate of mRNA degration, Alternative splicing events,...

Loss on RNA extraction, 90% of RNA in cell is rRNA, ligation

of adapters, conversion to cDNAnot 100%

Fail to map reads to correctgene, lane-specific biases onreading cDNA fragments,...

Page 22: RNA-seq: general concept, goal and experimental design - part 1

Consequence: focus on comparison

Phenotype A

Proteins

mRNA levels

cDNA pool

RNA-seq reads

Phenotype B

Proteins

mRNA levels

cDNA pool

RNA-seq reads

Possibly dueto differences in

expression

Page 23: RNA-seq: general concept, goal and experimental design - part 1

Consequence: focus on comparison

Phenotype A

Proteins

mRNA levels

cDNA pool

RNA-seq reads

Phenotype B

Proteins

mRNA levels

cDNA pool

RNA-seq reads

DESIGN OFEXPERIMENT

Page 24: RNA-seq: general concept, goal and experimental design - part 1

Comparing number of reads to genes

GeneA GeneB GeneC

sample

RNA-seq

Obviously, the number of reads is dependent on:1. the expression level of the gene2. the total number of reads generated3. the length of the transcript

OUR QUESTION

Normalisation is needed!Normalisation is needed!Normalisation is needed!

Page 25: RNA-seq: general concept, goal and experimental design - part 1

Experimental design

Our focus: which genes are differentially expressed between different conditions?

Obviously, the number of reads is dependent on:1. the expression level of the gene2. the total number of reads generated3. the length of the transcript

Which normalisation is needed?

How many reads to sequence?

Page 26: RNA-seq: general concept, goal and experimental design - part 1

Experimental design

Our focus: which genes are differentially expressed between different conditions?

“How can we detect genes for which the counts of reads change between conditions more systematically than as expected by chance”

We must design an experiment in which we can test this deviance from chance.

Oshlack et al. 2010. From RNA-seq reads to differential expression results. Genome Biology 2010, 11:220 http://genomebiology.com/2010/11/12/220

Page 27: RNA-seq: general concept, goal and experimental design - part 1

How many reads to sequence?

In other words: how deep to sequence? What is the required 'depth of sequencing'?

GeneA GeneB GeneC

sample

RNA-seq

RNA-seq

GeneA GeneB GeneC

The final test will look at ratios:6 5 3

5 6 4

1,2 0,83 0,75

sample

Page 28: RNA-seq: general concept, goal and experimental design - part 1

How many reads to sequence?

The difference between the lowest gene count and the highest gene count is typically 105. This is called the dynamic range.

Linear scale is useless. The logarithmic scale is better.

Wait! Something's not correct here!

Page 29: RNA-seq: general concept, goal and experimental design - part 1

Zero remains zero!

We are working with counts. A count is >=1. A gene with zero counts can be not yet sequenced (not deep enough) or is not expressed in that condition.

It is not a full logarithmic scale. It starts at zero.

0

Page 30: RNA-seq: general concept, goal and experimental design - part 1

So keep all counts above zero?

Assuming equal sequencing depth in the samples, and these counts. Do all these genes differ in expression? sample sample

GeneA 5 10 2

GeneB 15 30 2

GeneC 40 80 2

GeneD 100 200 2

GeneE 1000 2000 2

GeneZ 1 2 2

RATIO

Page 31: RNA-seq: general concept, goal and experimental design - part 1

So keep everything above zero?

sample sample

GeneA 11 10 0,91

GeneB 11 30 2,72

GeneC 60 80 1,33

GeneD 79 200 2,53

GeneE 1150 2000 1,74

GeneZ 5 1 0,20

RATIO

2?

Is there a trend in howthese numbers change?

Sequencing the result of the same steps again is called a technical replicate.

Page 32: RNA-seq: general concept, goal and experimental design - part 1

Technical replicates

sample

GeneA 11 5 4 4

GeneB 11 16 14 8

GeneC 60 45 32 38

GeneD 79 102 95 110

GeneE 1150 1023 987 1005

GeneZ 3 0 0 1

sample sample sample

We take the same cDNA pool and sequence it several times: technical replicates.

Page 33: RNA-seq: general concept, goal and experimental design - part 1

The poisson distribution

The counts of technical replicates follow a poisson distribution (Marioni et al 2008). The Poisson distribution can be applied to systems with a large number of possible events, each of which is rare.

From Wikipedia. Can be 3 different genes, each with their own poisson distribution. Lambda is the mean of the gene's distribution, with a certain number of reads.

Y=axis: chance to pick that number of reads.

Page 34: RNA-seq: general concept, goal and experimental design - part 1

The poisson distribution

So when we have 4 technical replicates sequenced up to a big depth (say 10 M reads). We can get by chance, these numbers for 3 different genes.

GeneA 0, 0, 1, 3

GeneB 2, 3, 4, 7

GeneC 8, 9, 11, 14

Page 35: RNA-seq: general concept, goal and experimental design - part 1

Working the intuition

How many blue balls?How many red balls?

Draw 10Draw 10 moreDraw 10 more

Estimate how large the fraction is in the set?

Page 36: RNA-seq: general concept, goal and experimental design - part 1

The intuition with the balls

Color 10 draws 20 draws 30 draws 40 draws

Blue

Red

No color

Page 37: RNA-seq: general concept, goal and experimental design - part 1

Conclusion of the experiment

How bigger the fraction in the pool, how quicker (i.e. with less sequencing depth) we are certain about the estimate of that fraction.

For lower counts, the variance is relatively bigger than the variance for higher counts.

CV (coëfficient of variation) = sqrt(count)/count

Genes with lower expression need much deeper sequencing than genes with higher expression levels.

estimate=count; variance=count

Page 38: RNA-seq: general concept, goal and experimental design - part 1

Comparing counts

“Here we show the overlap of Poisson distributions of single measurements at different read counts. Because relative Poisson uncertainty is high at low read counts, a count of 1 versus 2 has very little power to discriminate a true 2X fold change, though at higher counts a 2X fold change becomes significant.

In an actual experiment, the width of the distribution would be greater due to additional biological and technical uncertainty, but the uncertainty to the mean expression would narrow with each additional replicate.”

Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression. Bioinformatics (2013) doi: 10.1093/bioinformatics/btt015

Page 39: RNA-seq: general concept, goal and experimental design - part 1

Comparing technical replicates

Risso et al. “GC-Content Normalization for RNA-Seq Data”BMC Bioinformatics 2011, 12:480

http://www.biomedcentral.com/1471-2105/12/480 - EDASeq package (R)

Correlation between meanand variance

according to Poisson

Lowess fit throughthe data

(Log2 of the counts)

(Lo

g2 o

f th

e co

un

ts)

Page 40: RNA-seq: general concept, goal and experimental design - part 1

But poisson does not seem to fit

Extending the samples to real biological samples, this mean variance relationship does not hold...

Plotted using EDASeqPackage in R.

Page 41: RNA-seq: general concept, goal and experimental design - part 1

But poisson does not seem to fit

Extending the samples to real biological samples, this mean variance relationship does not hold!

Plotted using EDASeqPackage in R.

Reasonable fit

Something is going on!

Page 42: RNA-seq: general concept, goal and experimental design - part 1

An extra source of variation

The Poisson distribution has an 'overdispersed' variance: the variance is bigger than expected for higher counts between biological replicates.

Plotted using EDASeqPackage in R.

Something is going on!

Page 43: RNA-seq: general concept, goal and experimental design - part 1

An extra source of variation

Where Poisson: CV = std dev / mean => CV² = 1/μIf an additional distribution is involved (also dependent on π, the fraction of the gene in the cDNA pool), we have amixture of distributions:

CV² = 1/μ + φ

Low counts! dispersion

Generalization of Poisson with this extra parameter: the Negative Binomial Model fits better!

Page 44: RNA-seq: general concept, goal and experimental design - part 1

The negative binomial model

The NB model fits observed expression data of RNA-seq better. It is a generalization of Poisson, and 2 parameters need to be estimated (μ and φ)

Counts (gene g in sample j) has a Mean = μ

gj

Variance = μgj + φ

g μ

gj²

Biological CV² = φg

=> Biological CV = √φg

Methods differ in estimating this dispersion per gene:Can only be measured with true biological replicates

Page 45: RNA-seq: general concept, goal and experimental design - part 1

Variation summary, intuitively

Total CV² = Technical CV² + Biological CV²

For low counts, the Poisson (technical) variation or the measurement error is dominant.

For higher counts, the Poisson variation gets smaller, and another source of variation becomes dominant, the dispersion or the biological variation. Biological variation does not get smaller with higher counts.

Page 46: RNA-seq: general concept, goal and experimental design - part 1

Beyond the NB model

It appears from analysis of many biological replicates (#=69) that not every gene can be modeled as NB: the Poisson-Tweedie model provides a further generalisation and a better fit for many genes (with an additional shape parameter).

Left figure: raw data shows that about 26% of the genes fit a NB model. Depending on the estimated shape parameter, other distributions fit better.

Esnaola et al. BMC Bioinformatics 2013, 14:254http://www.biomedcentral.com/1471-2105/14/254

Page 47: RNA-seq: general concept, goal and experimental design - part 1

Consequence for our design

● For low counts: the uncertainty is big due to Poisson

● For high counts: the uncertainty is big due to biological variation. (highly expressed genes differ in their natural variation (regulated by cellular processes) more than lowly expressed genes).

● If we focus on the ratios between the conditions: is it reasonable to set a restriction of fold change? Highly expressed genes can have a smaller and be significant. Lowly expressed genes can exceed 2.

Page 48: RNA-seq: general concept, goal and experimental design - part 1

Consequence on fold change

The readily applied cut-off in micro-array analysis is in RNA-seq not of use.

Blue and red: known DE genes

Volcanoplot

These cut-offs oftenapplied can prohibitdetecting DE genes

Page 49: RNA-seq: general concept, goal and experimental design - part 1

Long story to say...

We need to estimate the model behind the count.

Never work without biological replicates.

Never work with 2 biological replicates.

Try avoiding working with 3 biological replicates.

Go for at least 4 biological replicates.

Page 50: RNA-seq: general concept, goal and experimental design - part 1

Break?

Page 51: RNA-seq: general concept, goal and experimental design - part 1

Overview

GeneA GeneB GeneC

Sample 1

RNA-seq

GeneA GeneB GeneC

Sample 2

RNA-seq

GeneA GeneB GeneC

Sample 3

RNA-seq

GeneA GeneB GeneC

Sample 4

RNA-seq

GeneA GeneB GeneC

Sample 5

RNA-seq

GeneA GeneB GeneC

Sample 6

RNA-seq

Condition X

Condition Y

Page 52: RNA-seq: general concept, goal and experimental design - part 1

Summary

Obviously, the number of reads is dependent on:1. chance

→ Define the count model (NB) from replicates2. the expression level of the gene

→ Compare the ratios with a test2. the total number of reads generated3. the length of the transcript

Page 53: RNA-seq: general concept, goal and experimental design - part 1

The total number of reads generated

GeneA GeneB GeneC

sample

RNA-seq

The number of reads is dependent on the total number of reads generated. If one library is sequenced to 20M reads, and another one to 40M, most genes will ~double their counts.

GeneA GeneB GeneC

sample

More RNA-seq

Page 54: RNA-seq: general concept, goal and experimental design - part 1

Normalization for library size

Naive approach: divide by total library size. Is not applied anymore!

Why not? Composition matters!

2 things to remember:- zero sum system (or “we cannot count what we can't sequence”)

- 5 orders of magnitude

Page 55: RNA-seq: general concept, goal and experimental design - part 1

Normalization for library size

2 things to remember:- zero sum system- 5 orders of magnitude

In every sample, a lot of reads are spend on few extremely highly expressed genes. Which genes? That differ between libraries, but affects negatively the naïve size normalization if we include those genes.

Page 56: RNA-seq: general concept, goal and experimental design - part 1

Normalization for library size

Schematically: when normalized on library size (square represent number of reads).

Rest of the genesRest of the genes

Few genes with enormous counts: there is NO SATURATION of these counts

All counts for library A All counts for library B

Page 57: RNA-seq: general concept, goal and experimental design - part 1

Normalization for library size

Better normalization would be as shown below. DESeq2 and EdgeR apply such an approach (see later).

Rest of the genesRest of the genes

100%

100%

Page 58: RNA-seq: general concept, goal and experimental design - part 1

Gene length influence the count

“Longer transcripts generate more reads”

True! But the transcript length does not differ between samples. Since we are concerned with relative differences between samples, this needs no normalization (this story changes in case of absolute quantification).

Sample A Sample B

Gene A

Gene B

Gene A

Gene B

Page 59: RNA-seq: general concept, goal and experimental design - part 1

Between sample variation

Properties of libraries/samples can effect the counts, and lead to variation. This is called between-lane variation. Obvious ones: library size (how many reads are sampled), library composition.

Different libraries/samples can exhibit increased variation by differing in how gene properties relate to gene counts. This is called within-lane variation.

Page 60: RNA-seq: general concept, goal and experimental design - part 1

GC-content of genes can influence counts

GC-content differs between genes. But it does not change between samples, so there should be no problem for relative expression comparison.

We can visualize the relationship between counts and GC very easily (see right). There is some trend, and it is equal for all samples.

EDAseq (R)

Page 61: RNA-seq: general concept, goal and experimental design - part 1

GC-content of genes can influence counts

Sometimes, samples show different relationships between GC-content of the genes and the counts.

This within-lane variation (or intra-sample) variation needs to be corrected for, so that in one sample not all differentially expressed genes are also the GC-riched ones.

Length can have also this effect.

Page 62: RNA-seq: general concept, goal and experimental design - part 1

What we need to know for our set-up

We want to detect differentially expressed genes between 2 or more conditions.

For this, we need to apply the conditions in a controlled environment (randomisation,...).

For good testing, we need to have some biological replicates per condition.

For cost effectiveness, we determine how deep we will sequence from each sample.

We analyse the reads, get raw counts and do the test!

Page 63: RNA-seq: general concept, goal and experimental design - part 1

Library preparation and lane loading

HiSeq2000: 24 single-index barcodes available. 1 lane gives 150-180 M reads. One lane of 50 bp SE approx €1.500.

Page 64: RNA-seq: general concept, goal and experimental design - part 1

Bioinformatics analysis will take most of your time

Quality control (QC) of raw reads

Preprocessing: filtering of reads and read parts, to help our goal of differential detection.

QC of preprocessing Mapping to a reference genome(alternative: to a transcriptome)

QC of the mapping

Count table extraction

QC of the count table

DE test

Biological insight

Page 65: RNA-seq: general concept, goal and experimental design - part 1

Bioinformatics analysis will take most of your time

Quality control (QC) of raw reads

Preprocessing: filtering of reads and read parts, to help our goal of differential detection.

QC of preprocessing Mapping to a reference genome(alternative: to a transcriptome)

QC of the mapping

Count table extraction

QC of the count table

DE test

Biological insight

Page 66: RNA-seq: general concept, goal and experimental design - part 1

Bioinformatics analysis will take most of your time

Quality control (QC) of raw reads

Preprocessing: filtering of reads and read parts, to help our goal of differential detection.

QC of preprocessing Mapping to a reference genome(alternative: to a transcriptome)

QC of the mapping

Count table extraction

QC of the count table

DE test

Biological insight

1

2

3

4

5

6

Page 67: RNA-seq: general concept, goal and experimental design - part 1

Overview

http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html

Page 68: RNA-seq: general concept, goal and experimental design - part 1

The numbers get reduced with every step

20M

25M

15M

Page 69: RNA-seq: general concept, goal and experimental design - part 1

Deeper, or more replicates?

Variance will be lower with more reads: but sequencing another biological replicate is preferred over sequencing deeper, or technical reps.

Doi: 10.1093/bioinformatics/btt015

Page 70: RNA-seq: general concept, goal and experimental design - part 1

There is tool to help you set up

Page 71: RNA-seq: general concept, goal and experimental design - part 1

Scotty – power analysis

Power: the probability to reject the null hypothesis if the alternative is true.

'How many samples and how deep in order to minimize false negatives'.

(a null hypothesis is always a scenario in which there is no difference, hence no differential expression).

Alternative tools:

http://wiki.bits.vib.be/index.php/RNAseq_toolbox

Page 72: RNA-seq: general concept, goal and experimental design - part 1

Help with design

http://wiki.bits.vib.be/index.php/RNAseq_toolbox http://rnaseq.uoregon.edu/exp_design.html

Page 73: RNA-seq: general concept, goal and experimental design - part 1

How many samples to sequence?

→ Scotty exercise

Page 74: RNA-seq: general concept, goal and experimental design - part 1

KeywordsA read count of a gene is dependent on:

1. chance

2. expression level

3. transcript length

4. depth of sequencing

5. GC-content

Poisson distribution

Negative binomial distribution

Condition

Sample

Normalization

Write in your own words what the terms mean

Page 75: RNA-seq: general concept, goal and experimental design - part 1

Reads

All my references available at:https://www.zotero.org/groups/dernaseq/items