discovering combinatorial biomarkers vipin kumar [email protected] kumar

Discovering Combinatorial Biomarkers

Vipin Kumar

[email protected]://www.cs.umn.edu/~kumar

Department of Computer Science and Engineering

ICCABS, Feb 2012

High-throughput technologies

DNA MethylationProteins

Gene Expression & non-coding RNA

SNP

Structural Variation Metabolites

Clinical Data e.g. brain imaging

Adopted from E. Schadt

2

Data mining offers potential solution for analysis of these large-scale datasets• Novel associations between genotypes and phenotypes• Biomarker discovery for complex diseases• Personalized Medicine – Automated analysis of patients history for customized treatment

Biomarker Discovery and its Impact

Biomarkers:

Genes:BRCA1 (breast cancer)

Protein variantsIVS5-13insC (type 2 diabetes)

Pathways/networks:P53 (cancers)

Clinical Impact:DiagnosisPrognosisTreatment

fMRISchizophrenia

vs controlsLim et al. 3

Miki et al. 1994 Chiefari et al. 2011 Oren et al. 2010

SNP as an illustration

4

NHGRI GWA Catalog www.genome.gov/GWAStudies

Published Genome-wide Associations through 06/2010 1,904 published GWA at p≤5*10-8 for 165 traits

5

Published Genome-wide Associations through 06/2011 1,449 published GWA at p≤5*10-8 for 237 traits

SNP as an illustration

NHGRI GWA Catalog www.genome.gov/GWAStudies

50% increase in one year

High coverage but low odds ratio (1.2)

High odds ratio (15.9) but low coverage (7%)

No significant associations

6

Challenge: Limitations of Single-locus Association Test

Many other studies

7

• Given a SNP data set of Myeloma patients, find SNPs that are associated with short vs. long survival.

• 3404 SNPs selected from various regions of the chromosome

• 70 cases (Patients survived shorter than 1 year)

• 73 Controls (Patients survived longer than 3 years)

cases

Controls

3404 SNPs

A Example where Single-locus Test Led to No Significant Associations

Top ranked SNP:-log10P-value = 3.8; Odds Ratio = 3.7

Van Ness et al 2008

Myeloma SNP data has signal the need of discovering combinations of SNPs

Myeloma Survival Data

8

Single-locus Tests Ignore Genetic Interaction

Ripke et al. 2011

Costanzo et al. 2010Scholl et al. 2009Ruzankina et al. 2009Kamath, 2003

Extensively observed in model organisms, e.g. yeast, C. elegans, fly.

Non-additive effect “Genetic Interaction”

A synthetic pattern

The focus of this talk: Higher-order Combinatorial Biomarker

9

......Complex biological systemComplex human diseases

Higher-order genetic buffering

Dis

ease

Con

trol

Triple mutations only exist in disease subjects

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

null


A B C D E



ABCDE

Disqualified

Prune all the supersets 10

+

+

Millions of user, thousands of items

Discovering High-order Combinatorial BiomarkersChallenge I: Computational Efficiency

Given n features, there are 2n candidates!

How to effectively handle the combinatorial search space? Brute-force search e.g. MDR can only handle 10~100 SNPs. [Rita et al. 2001]

The Apriori framework for efficient search of exponential space

[Agrawal et al. 1994]

Support based pruning

Discovering High-order Combinatorial BiomarkersChallenge I: Computational Efficiency

11[Fang et al. TKDE 2010]

A novel anti-monotonic objective function designed for mining low-support discriminative patterns from dense and high-dimensional data

• Traditional Apriori-based pattern mining techniques• Designed for sparse data

• Unique challenges of genomic datasets• High density

• A SNP dataset has a density of 33.33% • Three binary columns per SNP the three genotypes

• High dimensionality • Makes the search more challenging

• Disease heterogeneity• Each combination supported by a small fraction of subjects

Targeting patterns with better association than their subsets reduces # of hypothesis tests

Subsets having higher association

Subsets having lower association

null


A B C D E



ABCDE

• Computational challenges can be addressed by • Better algorithm design,

• e.g. Apriori-based• High-performance computing

• Statistical challenges call for additional efforts• Limited sample size• Huge number of hypothesis tests

12

[Fang, Haznadar, Wang, Yu, Steinbach, Church, Oetting, VanNess, Kumar, PLoS ONE, 2012]

Discovering High-order Combinatorial BiomarkersChallenge II: Statistical Power

Many combinations are trivial extensions of their subsets

Myeloma Survival Data Kidney Rejection Data Lung Cancer Data

13

High-order Combinatorial Biomarkers: an example

[Fang, Haznadar, Wang, Yu, Steinbach, Church, Oettng, Van Ness and Kumar, PLoS ONE, 2012][Fang, Pandey, Wang, Gupta, Steinbach and Kumar, IEEE TKDE, 2012]

Patie

nts

Con

trol

Best size-1

Best size-2

Best size-3

Best size-4

Size-5

Data from Church et al. 2010

www.ingenuity.com

Jump

All

heav

y sm

oker

s

Lung Cancer Data

The five genes are functionally related

14

Insights on High-order Functional InteractionsPatterns with positive Jump are functionally more coherent

[Fang, Haznadar, Wang, Yu, Steinbach, Church, Oettng, Van Ness and Kumar, PLoS ONE, 2012]

Lung

canc

erC

ontr

ol

Best size-1

Best size-2

Best size-3

Best size-4

Size-5

Lung cancer dataset

Jump

Kidney Rejection Data Lung Cancer Data Combined

15

High-order Combinations Discovered from Different Types of Data

[Fang, Haznadar, Wang, Yu, Steinbach, Church, Oettng, Van Ness and Kumar, PLoS ONE, 2012][Fang, Pandey, Wang, Gupta, Steinbach and Kumar, IEEE TKDE, 2012]

mRNA: Breast Cancer Metabolites: COPDSNP: acute kidney rejection

Data from Oetting et al. 2008 Data from Vijver et al. 2002 Data from Wendt et al. 2010

Rej

ectio

nN

o-re

ject

ion Su

rviv

ed (5

-yea

r)C

ontr

ol

AE

CO

PDSt

able

CO

PD

The proposed framework is general to handle different types of data

16

Biomarker Discovery using Error-tolerant Patterns

0111101111011110

0010000100000010

0010000100000100

1111111111111111

√

X True patterns are fragmented due to noise

and variability

Possible solution: Error-tolerant patterns

• These patterns differ in the way errors/noise in the data are tolerated

[Yang et al 2001]; [Pei et al 2001]; [Seppanen et al 2004]; [Liu et al 2006]; [Cheng et al 2006]; [Gupta et al., KDD 2008]; [Poernomo et al 2009]

See Gupta et al KDD 2008 for a survey

17

Greater fraction of error-tolerant patterns enrich at least one gene set (higher precision)

Greater fraction of gene sets are enriched by at least one error-tolerant pattern (higher recall)

Four Breast cancer gene-expression data sets are used for experiments:

GSE7390 GSE6532 GSE3494 GSE1456+ + +

158 cases

Cases: patients with metastasis within 5 years of follow-up; Controls: patients with no metastasis within 8 years of follow-up Discriminative Error-tolerant and traditional association patterns case/control

are discovered and evaluated by enrichment analysis using MSigDB gene sets

Error-tolerant patterns

Traditional patterns

Error-tolerant patterns

Traditional patterns

433 controls

Error-tolerant pattern vs. Traditional association patterns

Gupta et al. BICoB 2010; Gupta et al. BMC Bioinformatics 2011

18

• Differential Expression (DE)– Traditional analysis targets

changes of expression level

• Differential Coexpression (DC)– Changes of the coherence of

gene expression

• Combinatorial Search• Genetic Heterogeneity

– calls for subspace analysis

[Silva et al., 1995], [Li, 2002], [Kostka & Spang, 2005], [Rosemary et al., 2008], [Cho et al. 2009] etc.

[Eisen et al. 1999] [Golub et al., 1999], [Pan 2002], [Cui and Churchill, 2003] etc.

Differential Coexpression Patterns

Subspace Differential Coexpression Analysis

≈ 60%≈ 10%

Enriched with the TNF-α/NFkB signaling pathway (6/10 overlap with the pathway, corrected p value: 1.4*10-3)

Suggests that the dysregulation of TNF-α/NFkB pathway may be related to lung cancer

Selected for highlight talk, RECOMB SB 2010 Best Network Model award, Sage Congress, 2010[Fang, Kuang, Pandey, Steinbach, Myers and Kumar, PSB 2010]

Three lung cancer datasets [Bhattacharjee et al. 2001], [Stearman et al. 2005], [Su et al. 2007]

20

Combinatorial Biomarkers: Summary

• Higher-order combinations• Important for understanding complex human diseases

• A novel framework• Improved computational efficiency• Enhanced statistical power• Naturally handles disease heterogeneity• Error-tolerance• Different types of differentiation: coexpression

• General to handle different types of data • SNP• Gene expression• Metabolomic data• Brian imaging data (e.g. fMRI)

21

References• G. Fang, R. Kuang, G. Pandey, M. Steinbach, C.L. Myers, and V. Kumar. Subspace differential coexpression

analysis: problem definition and a general approach. Pacific Symposium on Biocomputing, 15:145-156, 2010.• G. Fang, G. Pandey, W. Wang, M. Gupta, M. Steinbach, and V. Kumar. Mining low-support discriminative patterns

from dense and high-dimensional data. IEEE TKDE, 24(2):279-294, 2012.• G. Fang, Majda Haznadar, Wen Wang, Haoyu Yu, Michael Steinbach, Tim Church, William Oetting, Brian Van Ness,

and Vipin Kumar. High-order SNP Combinations Associated with Complex Diseases: Efficient Discovery, Statistical Power and Functional Interactions. PLoS ONE, page in press, 2012.

• R. Gupta, N. Rao, and V. Kumar. Discovery of errortolerant biclusters from noisy gene expression data. In BMC Bioinformatics, 12(S12):S1, 2011.

• R. Gupta, Smita Agrawal, Navneet Rao, Ze Tian, Rui Kuang, Vipin Kumar, "Integrative Biomarker Discovery for Breast Cancer Metastasis from Gene Expression and Protein Interaction Data Using Error-tolerant Pattern Mining", In Proc. of the International Conference on Bioinformatics and Computational Biology (BICoB), 2010

• Gowtham Atluri, Rohit Gupta, Gang Fang, Gaurav Pandey, Michael Steinbach and Vipin Kumar, Association Analysis Techniques for Bioinformatics Problems, Proceedings of the 1st International Conference on Bioinformatics and Computational Biology (BICoB), pp 1-13, 2009.

• S. Landman Vipin Kumar Michael Steinbach, Haoyu Yu. Identification of Co-occurring Insertions in Cancer Genomes Using Association Analysis. International Journal of Data Mining and Bioinformatics, in press, 2012.

• M. Steinbach, H. Yu, G. Fang, and V. Kumar. Using constraints to generate and explore higher order discriminative patterns. Advances in Knowledge Discovery and Data Mining, pages 338-350, 2011.

• S. Dey, Gowtham Atluri, Michael Steinbach, Angus MacDonald, Kelvin Lim, and Vipin Kumar. A pattern mining based integrative framework for biomarker discovery. Tech report, Department of Computer Science, University of Minnesota, (002), 2012.

• G. Pandey, C. Myers, and V. Kumar. Incorporating functional inter-relationships into protein function prediction algorithms. BMC bioinformatics, 10(1):142, 2009.

• G. Pandey, B. Zhang, A.N. Chang, C.L. Myers, J. Zhu, V. Kumar, and E.E. Schadt. An integrative multi-network and multi-classifier approach to predict genetic interactions. PLoS computational biology, 6(9):e1000928, 2010 (Cited as one of the major computational biology breakthroughs of 2010 by a Nature Biotechnology feature article).

• J. Bellay, G. Atluri, T.L. Sing, K. Toufighi, M. Costanzo, P.S.M. Ribeiro, G. Pandey, J. Baller, B. VanderSluis, M. Michaut, et al. Putting genetic interactions in context through a global modular decomposition. Genome Research, 21(8):1375-1387, 2011.

Acknowledgement

Kumar Lab, Data MiningGang FangWen WangVanja PaunicYi YangBenjamin OatleyXiaoye LiuSanjoy DeyGowtham AtluriGaurav PandeyMichael Steinbach

Myers Lab, FuncGenomicsJeremy BellayChad Myers

Kuang Lab, CompbioTaeHyun HwangRui Kuang

Wendt Lab, Lung DiseaseChris Wendt

Masonic Cancer CenterTim ChurchBill Oetting

McDonald Lab, BehaviorAngus McDonald

Mayo Clinic-IBM-UMR fellowship, Walter Barnes Lang fellowship, NSF: #IIS0916439, UMII seed grant, BICB seed grant,

Computations enabled by the Minnesota Supercomputing Institute. BioMedical Genomics Center at University of Minnesota,

International Myeloma Foundation. Etiology and Early Marker Study program of the Prostate Lung Colorectal and Ovarian Cancer Screening Trial

Van Ness Lab, MyelomaBrian Van Ness

Lim Lab, Brain ImagingKelvin Lim

Thanks!

23

discovering combinatorial biomarkers vipin kumar [email protected] kumar

Documents

data mining

published gwa

high odds ratio

set of patients

clinical dimensions

paticular type of disease

insc type

chromosome70 cases patients