affydecomp: towards a benchmark for differential expression methods

Post on 19-Jan-2016

20 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

AffyDEComp: towards a benchmark for differential expression methods. Richard Pearson School of Computer Science University of Manchester. Overview. Why benchmark DE methods? The Golden Spike data set AffyDEComp Conclusions Recommendations. The need for benchmarks. - PowerPoint PPT Presentation

TRANSCRIPT

AffyDEComp: towards a benchmark for differential

expression methods

Richard Pearson

School of Computer Science

University of Manchester

Overview

Why benchmark DE methods?

The Golden Spike data set

AffyDEComp

Conclusions

Recommendations

The need for benchmarks

Microarray analysis has many stages

Competing methods at each stage

Methodologists good at showing superiority

Results can appear contradictory

Confused end users choice driven by…What they are familiar with

What colleagues use

What was used in their favourite paper

…and not by a scientific comparison

Benchmarking requirements

Methods: a set we wish to compareBenchmark data: where truth is knownMetrics: by which to compare methodsAffycomp

Methods: Summarisation methodsBenchmark data: various spike-in studiesMetrics: various, including, e.g. area under ROC curve for a fold change classifier

Affycomp doesn’t compare DE methods

A benchmark for DE methods

Methods:DE methods depend on summarisation

Compare summarisation/DE combinations

Benchmark data:Affycomp spike-ins have few DE genes

Golden spike data has many DE genes, but also a few “issues”!

Metrics:Based around areas under ROC curves

The Golden Spike data

3 “sample”, 3 “control” arrays

Many RNAs “spiked-in” at known levels

“DE”, “Equal” and “Empty” probesets.

Controversial data setNon-uniform null p-value distributions - use ROC

Spike-in concentrations high - unrepresentative

“DE” spike-ins all up-regulated - unrepresentative

Concentrations and FC confounded - loess

Different FC between “Equal” and “Empty”

“Empty” > FC than “Equal”

Most analyses have treated both Empty and Equal as True Negatives - to what effect?

“Empty” > FC than “Equal”

To illustrate how analysis choices effect results I’ll treat Empty and Equal as true negative (TN) and DE<=1.2 as true positive (TP)

2-sided test

Large apparent difference between methodsCan you guess which paper used this chart?

2-sided test

Large apparent difference between methodsAre TP correctly identified as up-regulated?

1-sided test of up-regulation

Probesets identified as up-regulated not TP

1-sided test of down-regulation

DE probesets are mostly being identified as down-regulated, despite the fact that they are in truth up-regulated

We appear to be identifying TP as down-regulated

DE <=1.2 lower than Empty

TP are identified as down-regulated because most TN are “Empty” which have higher FC than DE <= 1.2

Remove “empty” probesets

We can remedy this by using just Equal probesets as our TN…

…bearing in mind that this makes the data somewhat atypical

Up-regulation - Empty in TN

Probesets identified as up-regulated generally not TP when using Empty in TN

Up-regulation - TN Equal

Probesets identified as up-regulated more likely to be TP when using only Equal as TN

Down-regulation - Empty in TN

DE probesets are mostly being identified as down-regulated, despite the fact that they are in truth up-regulated

We appear to be identifying TP as down-regulated when including Empty in TN

Down-regulation - TN Equal

We generally don’t identify TP as down-regulated when excluding Empty in TN

“Recommended” test

We recommend using just Equal as TN, and all DE as TP

Recommended Up-reg

Using our recommendations, tests of up-regulation generally find TP, as expected

Recommended Down-reg

Using our recommendations, tests of down-regulation generally don’t find TP, as expected

Analysis decisions to make

Summarisation methodDE methodDirection of DE (recommend up)Choice of true negatives (equal only)Choice of true positives (all DE)Post-summarisation normalisation (loess using equal only)Type of ROC chart (standard ROC)Proportion of x-axis to display (all)

AffyDEComp - charts

AffyDEComp - comparison

AUCs - recommended choices

Conclusions

First step towards a reliable benchmark for DEGolden Spike data has some value if use of empty probesets is revisitedCertain combinations of summarisation/DE methods seem poor

Keep it open (Bioconductor) - because science should be reproducible!

Recommendations

Create a new spike-in data set whereSpike-in concentrations are realistic

DE spike-ins both up- and down-regulated

Concentrations and FC not confounded

Larger number of arrays

Benchmarks using regulatory information

Benchmarks for Illumina data

Benchmarks for SNP chips (GWA studies)

manchester.ac.uk/bioinformatics/affydecomp

top related