mario lauria
TRANSCRIPT
Rank-based Diagnostic Biomarkers
Mario Lauria The Microsoft Research – University of Trento Centre for Computational and Systems Biology (COSBI), Rovereto, Italy
PART 1
Overview of the COSBI method
THE PROBLEM WE ARE TRYING TO SOLVE
- The goal: to devise a computational method to classify clinical samples based on transcriptomics data.
- many disease/ conditions have a recognizable signature at the transcriptional level
- Motivation: to define a new way of diagnosing several diseases.
- high accuracy,
- early detection,
- first diagnostic test for several conditions
- Current issues :
- lack of repeatable results;
- a high profile scandal has seriously undermined trust in new results
THE CANCER BIOMARKER SCANDAL AT DUKE
THE TECHNICAL ISSUES IN SIGNATURE IDENTIFICATION
Observed problems
- low degree of overlap between comparable studies
- differences in lab protocols, normalization procedures, and confounding role of the “batch effect” (noise issue)
Our solution
- use of a composite, large signature (100’s of genes)
- lack of consensus on the size of a signature
Causes
- misguided insistence on identifying smallest number of “champions” (size issue)
- signatures based on gene ranks, not expression values
- a similarity map as output, illustrating signature-to-signature distance
SKETCH OF OUR APPROACH
• Our signatures based on rank => lower
data quality requirements
• Output of our method is a similarity
map of profiles => neighborhood
inspection adds a measure of
robustness
ranked list of genes
compositesignature
expression
profileof patient A
fullnetwork
network with thresholdapplied to distances
ROBUSTNESS TO BATCH EFFECTS
• Signature analysis of miRNA expression data for a subset of immune cell types- data set GSE28489 from Allantaz et al. PLoS ONE 2012 paper
• Conventional map of rank-based signatures (below) show separation by processing date
Clustering performed using PCA (from Allantaz et al. PLoS ONE 2012)
Clustering performed using rank-based signature method
- Our algorithm has no problem correctly clustering by cell types regardless of the date (right)
EXAMPLE OF MULTI-SET ANALYSIS
• Signature analysis of miRNA expression data computed on two sets from different labs
- data sets GSE28489 from HUG
and GSE28487 from Roche
[Allantaz et al. PLoS ONE 2012]
• Result: no obvious separation between HUG and Roche samples
• (see …_repn and …_REPn
labels respectiv.)
3RD PARTY VERIFICATION OF PERFORMANCE
• The SBV IMPROVER challenge is a crowdsourcing approach to the problem of defining an effective diagnostic signature• It is a joint initiative of IBM Research and Philip Morris International
• Challenge participants were asked to establish predictive signatures on unlabeled gene expression data sets for four diseases: • Psoriasis
• Multiple Sclerosis
• Chronic Obstructive Pulmonary Disease
• Lung Cancer
• The submitted predictions and signatures were subsequently scored by an Independent Scoring Committee against the Gold Standard
SBV IMPROVER CHALLENGE RESULTS
- Our purpose in entering the competition was to find out how well a method based on ranks and zero previous knowledge would do
- result: we placed 1st in the Multiple Sclerosis sub-challenge, and 2nd overall out of 52
teams
DETAILS OF RESULTS
• Respectable performance across the board
• low performance on COPD probably points to weakness of our method to differences in
genetic background
rank acc auroc aupr
COPD 24 0.5625 0.5820 0.6636
0.5820 0.4588
Lung Cancer 7 0.4800 0.7280 0.4524
0.6366 0.3753
0.6592 0.4436
0.6327 0.4389
MS Diag 1 0.8833 0.8973 0.8439
0.8973 0.9047
Psoriasis 11 0.9839 0.9857 0.9643
0.9857 0.9938
PART 2
The MS Diagnostic sub-challenge
THE MS DIAGNOSTIC SUBCHALLENGE
• OBJECTIVE: predicting relapsing-remitting multiple sclerosis (RRMS) or Control patients, based on the Peripheral Blood Mononuclear Cells (PBMC) transcriptome
• METHOD: two-step procedure:• we used the full E-MTAB-69 public dataset as training set,
• we then built a combined map of the SBV IMPROVER samples plus a subset of E-
MTAB-69 samples
• RESULT: the combined map (see next slide) produced two cluster of IMPROVER samples, that were later identified as MS/control by inspecting the differential expression of well known MS-associated genes • COSBI algorithm parameters:
• signature of size 200/300 (up/down)
• top 20% distances used for the map
MS DIAGNOSIS SUBCHALLENGE: THE SAMPLES MAP
• Map of samples from two datasets:
• E-TABM-69 (red, green, blue)
• SBV Improver datasets (pink nodes)
• Clustering performed using GLay algorithm in Cytoscape
After clustering
Before clustering
SENSITIVITY OF ALGORITHM PARAMETERS
• Low sensitivity to both signature length and distance threshold
Nup=200 / Ndown = 300Top 20% of edges
Nup=200 / Ndown = 300Top 10% of edges
Nup=250 / Ndown = 250Top 20% of edges
RELATED WORK
• Rank-based method are not new
• k-TSP is a signature definition method based on small (size k) signatures [Tan et al 2005]
• The combined use of rank + maps was proposed by Iorio [Iorio et al 2010] for analyzing Connectivity Map (cMAP) datasets
CONCLUSIONS AND FUTURE APPLICATIONS
• Our method is intuitive, quite general and completely oblivious of the underlying biology of the disease
• method in its simplest form already performs well
• It can be applied to any large dimensional data, therefore many applications are conceivable beyond gene expression signatures
• Current/future work:
• algorithm improvements: selection of signature size, other method of gene selection
• new applications: very encouraging results on profiles of circulating miRNA as
diagnostic biomarkers
ACKNOWLEDGMENTS
• Francesco Iorio (formerly TIGEM, now EBI) for the insightful discussions
• PMI for the funding
• Corrado Priami and others at COSBI for insightful discussions and support
THANK YOU
LARGE MS MAP