1 global expression analysis monday 10/1: intro* 1 page project overview due intro to r lab...

1

Global expression analysis

Monday 10/1: Intro * 1 page Project Overview DueIntro to R lab

Wednesday 10/3: Stats & FDR - * read the paper!

Monday 10/8: Calling differentially expressed genes with baySeq - *read the paper!baySeq lab for RNA-seq data

Wednesday 10/10: Clustering analysis

Monday 10/15: Clustering analysisClustering lab

Wednesday 10/17: Motif analysis

Monday 10/12: Motif analysisMotif lab

Wednesday 10/14: ChIP/RIP/Nuc/Ect-Seq

2

Global expression analysis

Goal: To measure transcript abundance of every gene in your organism at once …

AND make sense out of it

The power is in organizing genomic expression data to find meaningful patterns & groups of genes

Gasch et al. 2000, 2001

4

What kinds of information can we extract from genomic expression data?

1. Hypothetical functions for uncharacterized genes-- genes encoding subunits of multi-subunit protein complexes

are often highly coregulatedexample: ribosomal protein genes, proteasome genes in yeast

-- genes involved in the same cellular processes are often coregulated

2. New roles for characterized genes

5. Understanding developmental pathways

4. Implications of gene regulation-- WT vs. mutants can identify transcription factor targets-- promoter analysis of coregulated genes = upstream elements-- gene coregulation with known pathway targets can implicate

pathway activity

3. Better understanding of the experimental conditions-- based on expression patterns of characterized genes

6. Defining samples based on expression profilesexample: comparing tumor samples from patients

5

Technologies for Quantifying & Identifying Nucleic Acids

DNA microarrays Deep sequencing

1. Collect RNA2. Generate fluorescently-labeled

cDNA3. Hybridize to array4. Detect fluorescence emission

with scanning laser

Data: Continuous measurements of relative fluorescence

1. Collect RNA2. Make strand-specific cDNA library3. Deep sequence short reads4. Relate sequences back to

genome / transcriptome location(or de novo assembly)

Data: Number of sequencing reads pereach base in the genome = Discrete ‘Counts’

6

ORF

mRNA

Array Probes

Tiled-genome arrays cover the entire genome

7

Tiled sequences across each gene / locus

To get relative differences in expression across two samples:1. Need to normalize array signals across arrays2. Need to compress measurements to a single score

for each gene/transcript

Tiled genomic arrays (Nimblegen, Affymetrix, Agilent)

8

PM

MM

‘Robust Multiarray Analysis’ (RMA Irizarry et al. 2003)1. On Affy: Throw out elements where MM signal > PM signal

… but otherwise ignore MM

2. Local background subtraction from each probe intensity

3. Quantile normalization of arrays to be compared… sets the distribution of probe intensities to be the same

4. Convert intensity values to log2 scale

5. Use a linear model to fit a given probe set and compute one expression value per gene

PM = ‘perfect match’ oligoMM = ‘mismatch’ oligo (central nucleotide is mutated)

Tiled genomic arrays (Nimblegen, Affymetrix, Agilent)

9

Deep sequencing for gene expression analysis

mRNA

Old protocol: make ds cDNA

New protocols:1st strand cDNA

(2nd strand with dUTP)

Sequence

Sequence

Number of sequencing reads per region ~= number of starting transcripts

10

Number of sequencing reads per region ~= number of starting transcripts

* But sometimes one lane of sequencing works better than others:Simple normalization: Avg counts within gene length / Total Counts in That LaneRPKM: Reads Per Kb per Million mapped reads

BUT … have to account for the length of the gene/transcript:

Counts per base pair

Total reads in lane

40 x 106

32 x 106

11

Another challenge: mapping reads to the genome/transcriptome

intron

Spliced transcript

DNA

DNA

Should you restrict yourself to ORF annotations?

Can map reads to genome or transcriptome sequence, or assemble de novo.

12

Comparing samples via fold-changes: RPKM across samples reflects

Differential Expression

Usually work in log2 space

13

ID Log ratioYPL187W 6.36YGR043C 1.82YGL089C 6.439YCR040W 1.012YCR039C 1.147YCL001W 1.934YJR004C 2.76YLL005C 2.395YGL101W 2.22YLR040C 2.073upgrade plate 1.863EMPTY 1.755upgrade plate 1.573EMPTY 1.529YBL051C 1.419YLR349W 1.382YCL066W 1.338YLR227W-A 1.335upgrade plate 1.314YDL186W 1.246YDR536W 1.183upgrade plate 1.165YHR124W 1.163EMPTY 1.127YAL065C 1.091YBR012W-A 1.078YCL026C-A 1.046YJL078C 1.045YHR161C 1.033YBR244W 1.028YGR237C 1YGL189C 0.997YCL009C 0.989YKL185W 0.968YDR285W 0.95YMR057C 0.949Q0250 0.942YOR235W 0.924YDR415C 0.922YER072W 0.906EMPTY 0.892EMPTY 0.89YDL013W 0.877YLR206W 0.874YML047C 0.874YDR306C 0.858YDR528W 0.823YGL088W 0.8YBL097W 0.787YBR013C 0.782YIR019C 0.779YDR361C 0.772YLR267W 0.769YAL008W 0.746YGL128C 0.741YDR530C 0.739

ID Log Ratio (635/532)YPL187W -0.072YGR043C -0.228YGL089CYCR040W 0.694YCR039C -0.487YCL001W -0.536YJR004C 0.026YLL005C -0.008YGL101W 0YLR040C -0.659upgrade plate -0.408EMPTY -0.008upgrade plate 0.109EMPTY -0.866YBL051C -0.054YLR349W -0.457YCL066WYLR227W-A -0.419upgrade plate -0.401YDL186W 0.959YDR536W -0.58upgrade plate 0.543YHR124W -0.465EMPTY -0.715YAL065C -1.133YBR012W-A 0.676YCL026C-A -0.468YJL078C -0.889YHR161C -0.033YBR244WYGR237C -0.754YGL189C -0.11YCL009C 0.014YKL185WYDR285W -0.435YMR057C 0.672Q0250 -0.219YOR235W 1.166YDR415C -0.334YER072W -0.509EMPTY -1.174EMPTY -0.818YDL013WYLR206WYML047C -0.819YDR306CYDR528W 0.276YGL088WYBL097WYBR013C -0.896YIR019CYDR361C -1.017YLR267W -0.457YAL008W 1.465YGL128C 0.027YDR530C 2.083

Now each sample = list of normalized relative transcript valuesArray 1 Array 2

14

Assessing replicates: how well do the data agree overall?linear regression

Example of good replicatesy = 0.978x + 0.0095

R2 = 0.8332

-4

-3

-2

-1

0

1

2

3

4

5

-4 -2 0 2 4

Array 1 values

Arr

ay

2 v

alu

es

DES460 + 0.2% MMS - 45min

Linear (DES460 + 0.2%MMS - 45 min)

Example of bad replicates y = 0.1104x - 0.0358

R2 = 0.0205

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

-6 -5 -4 -3 -2 -1 0 1 2 3 4

Array 1 values

Arr

ay

2 v

alu

es

Where does the noise come from?-- can be biological variation

-- can be array artifacts… should define both types of variation …

15

Now you have your data, in the form of relative log2 expression differences

Now what?

16

Select differentially expressed genes to focus on

Methods of gene selection:

-- arbitrary fold-expression-change cutoffexample: genes that change >3X in expression between samples

-- statistically significant change in expressionrequires replicates

Expression difference

Gene X expression under condition 1Gene X expression under condition 2

17


Gene X expression under condition 1Gene X expression under condition 2





18


Use statistics to compare the mean & variation of 2 (or more)

populations





19

Test if the means of 2 (or more) groups are the same or statistically different

The ‘null hypothesis’ H0 says that the two groups are statistically the same-- you will either accept or reject the null hypothesis

Choosing the right test:

parametric test if your data are normally distributed with equal variance

nonparametric test if neither of the above are true

Why do the data need to be normally distributed?

20

Test if the means of 2 groups are the same or statistically different

The ‘null hypothesis’ H0 says that the two groups are statistically the same-- you will either accept or reject the null hypothesis

T = X1 – X2 difference in the means

standard error of the difference in the meansSED

If your two samples are normally distributed with equal variance, use the t-test

If T > Tc where Tc is the critical value for the degrees of freedom & confidence level,then reject H0

Notice that if the data aren’t normally distributed mean and standard deviation are not meaningful.

21

Differential expression on DNA microarrays:Bioconductor package Limma (ref)

** See previous years’ limma lab for a walk-through example

1. Load your data2. Provide a ‘target’ file that says which samples are on which arrays3. Provide a ‘design’ file (and in some cases a ‘contrast matrix’) to specify

which samples you want to compare4. Limma will look at the entire dataset and model the error on the data, to tryto over-come measurement error

5. Limma then does a modified T-test to identify genes with significant expressiondifferences across the samples you specified.

1 global expression analysis monday 10/1: intro* 1 page project overview due intro to r lab...

Documents