mario lauria

Rank-based Diagnostic Biomarkers

Mario Lauria The Microsoft Research – University of Trento Centre for Computational and Systems Biology (COSBI), Rovereto, Italy

PART 1

Overview of the COSBI method

THE PROBLEM WE ARE TRYING TO SOLVE

- The goal: to devise a computational method to classify clinical samples based on transcriptomics data.

- many disease/ conditions have a recognizable signature at the transcriptional level

- Motivation: to define a new way of diagnosing several diseases.

- high accuracy,

- early detection,

- first diagnostic test for several conditions

- Current issues :

- lack of repeatable results;

- a high profile scandal has seriously undermined trust in new results

THE CANCER BIOMARKER SCANDAL AT DUKE

THE TECHNICAL ISSUES IN SIGNATURE IDENTIFICATION

Observed problems

- low degree of overlap between comparable studies

- differences in lab protocols, normalization procedures, and confounding role of the “batch effect” (noise issue)

Our solution

- use of a composite, large signature (100’s of genes)

- lack of consensus on the size of a signature

Causes

- misguided insistence on identifying smallest number of “champions” (size issue)

- signatures based on gene ranks, not expression values

- a similarity map as output, illustrating signature-to-signature distance

SKETCH OF OUR APPROACH

• Our signatures based on rank => lower

data quality requirements

• Output of our method is a similarity

map of profiles => neighborhood

inspection adds a measure of

robustness

ranked list of genes

compositesignature

expression

profileof patient A

fullnetwork

network with thresholdapplied to distances

ROBUSTNESS TO BATCH EFFECTS

• Signature analysis of miRNA expression data for a subset of immune cell types- data set GSE28489 from Allantaz et al. PLoS ONE 2012 paper

• Conventional map of rank-based signatures (below) show separation by processing date

Clustering performed using PCA (from Allantaz et al. PLoS ONE 2012)

Clustering performed using rank-based signature method

- Our algorithm has no problem correctly clustering by cell types regardless of the date (right)

EXAMPLE OF MULTI-SET ANALYSIS

• Signature analysis of miRNA expression data computed on two sets from different labs

- data sets GSE28489 from HUG

and GSE28487 from Roche

[Allantaz et al. PLoS ONE 2012]

• Result: no obvious separation between HUG and Roche samples

• (see …_repn and …_REPn

labels respectiv.)

3RD PARTY VERIFICATION OF PERFORMANCE

• The SBV IMPROVER challenge is a crowdsourcing approach to the problem of defining an effective diagnostic signature• It is a joint initiative of IBM Research and Philip Morris International

• Challenge participants were asked to establish predictive signatures on unlabeled gene expression data sets for four diseases: • Psoriasis

• Multiple Sclerosis

• Chronic Obstructive Pulmonary Disease

• Lung Cancer

• The submitted predictions and signatures were subsequently scored by an Independent Scoring Committee against the Gold Standard

SBV IMPROVER CHALLENGE RESULTS

- Our purpose in entering the competition was to find out how well a method based on ranks and zero previous knowledge would do

- result: we placed 1st in the Multiple Sclerosis sub-challenge, and 2nd overall out of 52

teams

DETAILS OF RESULTS

• Respectable performance across the board

• low performance on COPD probably points to weakness of our method to differences in

genetic background

rank acc auroc aupr

COPD 24 0.5625 0.5820 0.6636

0.5820 0.4588

Lung Cancer 7 0.4800 0.7280 0.4524

0.6366 0.3753

0.6592 0.4436

0.6327 0.4389

MS Diag 1 0.8833 0.8973 0.8439

0.8973 0.9047

Psoriasis 11 0.9839 0.9857 0.9643

0.9857 0.9938

PART 2

The MS Diagnostic sub-challenge

THE MS DIAGNOSTIC SUBCHALLENGE

• OBJECTIVE: predicting relapsing-remitting multiple sclerosis (RRMS) or Control patients, based on the Peripheral Blood Mononuclear Cells (PBMC) transcriptome

• METHOD: two-step procedure:• we used the full E-MTAB-69 public dataset as training set,

• we then built a combined map of the SBV IMPROVER samples plus a subset of E-

MTAB-69 samples

• RESULT: the combined map (see next slide) produced two cluster of IMPROVER samples, that were later identified as MS/control by inspecting the differential expression of well known MS-associated genes • COSBI algorithm parameters:

• signature of size 200/300 (up/down)

• top 20% distances used for the map

MS DIAGNOSIS SUBCHALLENGE: THE SAMPLES MAP

• Map of samples from two datasets:

• E-TABM-69 (red, green, blue)

• SBV Improver datasets (pink nodes)

• Clustering performed using GLay algorithm in Cytoscape

After clustering

Before clustering

SENSITIVITY OF ALGORITHM PARAMETERS

• Low sensitivity to both signature length and distance threshold

Nup=200 / Ndown = 300Top 20% of edges



RELATED WORK

• Rank-based method are not new

• k-TSP is a signature definition method based on small (size k) signatures [Tan et al 2005]

• The combined use of rank + maps was proposed by Iorio [Iorio et al 2010] for analyzing Connectivity Map (cMAP) datasets

CONCLUSIONS AND FUTURE APPLICATIONS

• Our method is intuitive, quite general and completely oblivious of the underlying biology of the disease

• method in its simplest form already performs well

• It can be applied to any large dimensional data, therefore many applications are conceivable beyond gene expression signatures

• Current/future work:

• algorithm improvements: selection of signature size, other method of gene selection

• new applications: very encouraging results on profiles of circulating miRNA as

diagnostic biomarkers

ACKNOWLEDGMENTS

• Francesco Iorio (formerly TIGEM, now EBI) for the insightful discussions

• PMI for the funding

• Corrado Priami and others at COSBI for insightful discussions and support

THANK YOU

LARGE MS MAP

mario lauria

Documents