q-vals (and false discovery rates) made easy dennis shasha based on the paper "statistical...
Post on 22-Dec-2015
217 views
TRANSCRIPT
Q-Vals (and False Discovery Rates) Made Easy
Dennis ShashaBased on the paper
"Statistical significance for genomewide studies"by John Storey and Robert Tibshirani
PNAS August 5, 2003 9440-9445
Challenge
• You test plants/patients/… in two settings (or from different populations).
• You want to know which genes are differentially expressed (alternate)
• You don’t want to make too many mistakes (declaring a gene to be alternate when in fact it’s null – not differentially expressed).
First Idea
• You take p-vals of the differences in expression.
• P-val(g) is the probability that if g is null, it would have a difference at least this large.
• You choose a cutoff, say 0.05.
• You say all genes that differ with p-val <= 0.05 are truly different.
• What’s the problem?
Thought Experiment
• Suppose that no genes are truly differentially expressed.
• You will conclude that about 5% of those you called significant really are.
• Your false discovery rate (number null among those predicted to be alternate/number predicted to be alternate) = 100%.
• Bad.
A Fundamental Insight
• All truly null genes (i.e. not truly differentially expressed) are equally likely to have any p-val.
• That is by construction of p-val: under the null hypothesis, 1% of the genes will be in the top 1 percentile, 1% will be in percentile between 89 and 90th and so on. P-val is just a way of saying percentile in null condition.
What Do We Do With That?
• Mixture model: imagine null genes as light blue marbles and truly different genes as red ones.
• If the assay is decent, red marbles should be concentrated at the low p-values.
0 …. Pval …………………………………………………1
Method We Can Use
• We don’t of course know the colors of the marbles/we don’t know which genes are true alternates.
• However, we know that null marbles are equally likely to have any p-value.
• So, at the p-value where the height of the marbles levels off, we have primarily light blue marbles/null genes.
• Why?
0 …. Pval …………………………………………………1
Flat region starts here
Level of flat region
Answer
• Because if all genes/marbles were null, the heights would be about uniform.
• Provided the reds are concentrated near the low p-vals, the flat regions will be primarily light blues.
Example: all null
• Consider the all null case.
• All marbles are light blue.
• False discovery rate in region to left of flat region is estimated number of white marbles (based on flat region)/number of marbles to left of flat region.
• This will be close to 100%
0 …. Pval …………………………………………………1
Flat region starts here
Level of flat region
Example: all non-null
• Consider the all non-null case.• All marbles are red and they are highly
skewed. • Flat region is essentially zero.• False discovery rate in region to left of flat
region is estimated number of white marbles (based on flat region)/number of marbles to left of flat region.
• This will be close to 0.
0 …. Pval …………………………………………………1
Flat region starts here
Example: mixed case
• Get a distribution of p-values.
• Find flat region.
• Estimate number of nulls in the left-of-flat region by extending the flat line.
• This gives the false discovery rate.
0 …. Pval ……………………………………………1
Flat line; base level of nulls
Num
ber of genes having pval
Possible p-value threshold
Example: mixed case
• What would you estimate the false discovery rate to be in the case that we declare the entire area to the left of the possible p-value threshold to be significant?
• 10%, 25%, 50%?
0 …. Pval ……………………………………………1
Flat line; base level of nulls
Num
ber of genes having pval
Possible p-value threshold
Obtaining q-values from False Discovery Rate
• Suppose we order genes from least p-value to greatest.
• That corresponds to one of these cartesian graphs.
• The q-value of a gene having p-value p is exactly the False Discovery Rate if the declared significance region had a threshold of p.
0 …. Pval ……………………………………………1
Flat line; base level of nulls
Num
ber of genes having pval
Q-value of a gene having this p-val is the FDR if this is the significance threshold.
Lessons for Research
• Mushy p-values (large error bars/few replicates) may force us to the far left in order to get a low False Discovery Rate.
• This may eliminate genes of interest.
• If testing out a gene is not too expensive, then we can accept a higher False Discovery Rate – nothing magical about 0.01.
0 …. Pval ……………………………………………1
Flat line; base level of nulls
Num
ber of genes having pval
Better p-values avoid loss of genes, for small FalseDiscovery Rate.