1 global expression analysis monday 10/1: intro* 1 page project overview due intro to r lab...
TRANSCRIPT
1
Global expression analysis
Monday 10/1: Intro * 1 page Project Overview DueIntro to R lab
Wednesday 10/3: Stats & FDR - * read the paper!
Monday 10/8: Calling differentially expressed genes with baySeq - *read the paper!baySeq lab for RNA-seq data
Wednesday 10/10: Clustering analysis
Monday 10/15: Clustering analysisClustering lab
Wednesday 10/17: Motif analysis
Monday 10/12: Motif analysisMotif lab
Wednesday 10/14: ChIP/RIP/Nuc/Ect-Seq
2
Global expression analysis
Goal: To measure transcript abundance of every gene in your organism at once …
AND make sense out of it
The power is in organizing genomic expression data to find meaningful patterns & groups of genes
Gasch et al. 2000, 2001
4
What kinds of information can we extract from genomic expression data?
1. Hypothetical functions for uncharacterized genes-- genes encoding subunits of multi-subunit protein complexes
are often highly coregulatedexample: ribosomal protein genes, proteasome genes in yeast
-- genes involved in the same cellular processes are often coregulated
2. New roles for characterized genes
5. Understanding developmental pathways
4. Implications of gene regulation-- WT vs. mutants can identify transcription factor targets-- promoter analysis of coregulated genes = upstream elements-- gene coregulation with known pathway targets can implicate
pathway activity
3. Better understanding of the experimental conditions-- based on expression patterns of characterized genes
6. Defining samples based on expression profilesexample: comparing tumor samples from patients
5
Technologies for Quantifying & Identifying Nucleic Acids
DNA microarrays Deep sequencing
1. Collect RNA2. Generate fluorescently-labeled
cDNA3. Hybridize to array4. Detect fluorescence emission
with scanning laser
Data: Continuous measurements of relative fluorescence
1. Collect RNA2. Make strand-specific cDNA library3. Deep sequence short reads4. Relate sequences back to
genome / transcriptome location(or de novo assembly)
Data: Number of sequencing reads pereach base in the genome = Discrete ‘Counts’
6
ORF
mRNA
Array Probes
Tiled-genome arrays cover the entire genome
7
Tiled sequences across each gene / locus
To get relative differences in expression across two samples:1. Need to normalize array signals across arrays2. Need to compress measurements to a single score
for each gene/transcript
Tiled genomic arrays (Nimblegen, Affymetrix, Agilent)
8
PM
MM
‘Robust Multiarray Analysis’ (RMA Irizarry et al. 2003)1. On Affy: Throw out elements where MM signal > PM signal
… but otherwise ignore MM
2. Local background subtraction from each probe intensity
3. Quantile normalization of arrays to be compared… sets the distribution of probe intensities to be the same
4. Convert intensity values to log2 scale
5. Use a linear model to fit a given probe set and compute one expression value per gene
PM = ‘perfect match’ oligoMM = ‘mismatch’ oligo (central nucleotide is mutated)
Tiled genomic arrays (Nimblegen, Affymetrix, Agilent)
9
Deep sequencing for gene expression analysis
mRNA
Old protocol: make ds cDNA
New protocols:1st strand cDNA
(2nd strand with dUTP)
Sequence
Sequence
Number of sequencing reads per region ~= number of starting transcripts
10
Number of sequencing reads per region ~= number of starting transcripts
* But sometimes one lane of sequencing works better than others:Simple normalization: Avg counts within gene length / Total Counts in That LaneRPKM: Reads Per Kb per Million mapped reads
BUT … have to account for the length of the gene/transcript:
Counts per base pair
Total reads in lane
40 x 106
32 x 106
11
Another challenge: mapping reads to the genome/transcriptome
intron
Spliced transcript
DNA
DNA
Should you restrict yourself to ORF annotations?
Can map reads to genome or transcriptome sequence, or assemble de novo.
12
Comparing samples via fold-changes: RPKM across samples reflects
Differential Expression
Usually work in log2 space
13
ID Log ratioYPL187W 6.36YGR043C 1.82YGL089C 6.439YCR040W 1.012YCR039C 1.147YCL001W 1.934YJR004C 2.76YLL005C 2.395YGL101W 2.22YLR040C 2.073upgrade plate 1.863EMPTY 1.755upgrade plate 1.573EMPTY 1.529YBL051C 1.419YLR349W 1.382YCL066W 1.338YLR227W-A 1.335upgrade plate 1.314YDL186W 1.246YDR536W 1.183upgrade plate 1.165YHR124W 1.163EMPTY 1.127YAL065C 1.091YBR012W-A 1.078YCL026C-A 1.046YJL078C 1.045YHR161C 1.033YBR244W 1.028YGR237C 1YGL189C 0.997YCL009C 0.989YKL185W 0.968YDR285W 0.95YMR057C 0.949Q0250 0.942YOR235W 0.924YDR415C 0.922YER072W 0.906EMPTY 0.892EMPTY 0.89YDL013W 0.877YLR206W 0.874YML047C 0.874YDR306C 0.858YDR528W 0.823YGL088W 0.8YBL097W 0.787YBR013C 0.782YIR019C 0.779YDR361C 0.772YLR267W 0.769YAL008W 0.746YGL128C 0.741YDR530C 0.739
ID Log Ratio (635/532)YPL187W -0.072YGR043C -0.228YGL089CYCR040W 0.694YCR039C -0.487YCL001W -0.536YJR004C 0.026YLL005C -0.008YGL101W 0YLR040C -0.659upgrade plate -0.408EMPTY -0.008upgrade plate 0.109EMPTY -0.866YBL051C -0.054YLR349W -0.457YCL066WYLR227W-A -0.419upgrade plate -0.401YDL186W 0.959YDR536W -0.58upgrade plate 0.543YHR124W -0.465EMPTY -0.715YAL065C -1.133YBR012W-A 0.676YCL026C-A -0.468YJL078C -0.889YHR161C -0.033YBR244WYGR237C -0.754YGL189C -0.11YCL009C 0.014YKL185WYDR285W -0.435YMR057C 0.672Q0250 -0.219YOR235W 1.166YDR415C -0.334YER072W -0.509EMPTY -1.174EMPTY -0.818YDL013WYLR206WYML047C -0.819YDR306CYDR528W 0.276YGL088WYBL097WYBR013C -0.896YIR019CYDR361C -1.017YLR267W -0.457YAL008W 1.465YGL128C 0.027YDR530C 2.083
Now each sample = list of normalized relative transcript valuesArray 1 Array 2
14
Assessing replicates: how well do the data agree overall?linear regression
Example of good replicatesy = 0.978x + 0.0095
R2 = 0.8332
-4
-3
-2
-1
0
1
2
3
4
5
-4 -2 0 2 4
Array 1 values
Arr
ay
2 v
alu
es
DES460 + 0.2% MMS - 45min
Linear (DES460 + 0.2%MMS - 45 min)
Example of bad replicates y = 0.1104x - 0.0358
R2 = 0.0205
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
-6 -5 -4 -3 -2 -1 0 1 2 3 4
Array 1 values
Arr
ay
2 v
alu
es
Where does the noise come from?-- can be biological variation
-- can be array artifacts… should define both types of variation …
15
Now you have your data, in the form of relative log2 expression differences
Now what?
16
Select differentially expressed genes to focus on
Methods of gene selection:
-- arbitrary fold-expression-change cutoffexample: genes that change >3X in expression between samples
-- statistically significant change in expressionrequires replicates
Expression difference
Gene X expression under condition 1Gene X expression under condition 2
17
Expression difference
Gene X expression under condition 1Gene X expression under condition 2
Select differentially expressed genes to focus on
Methods of gene selection:
-- arbitrary fold-expression-change cutoffexample: genes that change >3X in expression between samples
-- statistically significant change in expressionrequires replicates
18
Expression difference
Use statistics to compare the mean & variation of 2 (or more)
populations
Select differentially expressed genes to focus on
Methods of gene selection:
-- arbitrary fold-expression-change cutoffexample: genes that change >3X in expression between samples
-- statistically significant change in expressionrequires replicates
19
Test if the means of 2 (or more) groups are the same or statistically different
The ‘null hypothesis’ H0 says that the two groups are statistically the same-- you will either accept or reject the null hypothesis
Choosing the right test:
parametric test if your data are normally distributed with equal variance
nonparametric test if neither of the above are true
Why do the data need to be normally distributed?
20
Test if the means of 2 groups are the same or statistically different
The ‘null hypothesis’ H0 says that the two groups are statistically the same-- you will either accept or reject the null hypothesis
T = X1 – X2 difference in the means
standard error of the difference in the meansSED
If your two samples are normally distributed with equal variance, use the t-test
If T > Tc where Tc is the critical value for the degrees of freedom & confidence level,then reject H0
Notice that if the data aren’t normally distributed mean and standard deviation are not meaningful.
21
Differential expression on DNA microarrays:Bioconductor package Limma (ref)
** See previous years’ limma lab for a walk-through example
1. Load your data2. Provide a ‘target’ file that says which samples are on which arrays3. Provide a ‘design’ file (and in some cases a ‘contrast matrix’) to specify
which samples you want to compare4. Limma will look at the entire dataset and model the error on the data, to tryto over-come measurement error
5. Limma then does a modified T-test to identify genes with significant expressiondifferences across the samples you specified.