Combinatorial Methods for Disease Association
Search and Susceptibility Prediction
Alexander Zelikovsky
joint work with Dumitru BrinzaDepartment of Computer Science
2
Outline
SNPs, Haplotypes and Genotypes Disease Association Search
Genome-wide association search challenges Problem formulation Exhaustive & Combinatorial Search Optimization formulation & complimentary greedy search
Predicting susceptibility to complex diseases Problem formulation/cross-validation Previous methods: SVM, RF, LP Optimum clustering and prediction via model-fitting
Conclusions
3
Length of Human Genome 3 109 #Single nucleotide polymorphism (SNPs) 1 107 SNPs are mostly biallelic, e.g., AC Minor allele frequency should be considerable e.g. >.1% Difference b/w ALL people 0.25% (b/w any 2 0.1%) Diploid = two different copies of each chromosome Haplotype = description of a single copy (expensive)
example: 00110101 (0 is for major, 1 is for minor allele) Genotype = description of the mixed two copies
example 01122110 (0=00, 1=11, 2=01) International Hapmap project: www.hapmap.org
SNP, Haplotypes, Genotypes
4
Challenges of Disease Association Monogenic disease
A mutated gene is entirely responsible for the disease . Typically rare in population: < 0.1%.
Complex disease Interaction of multiple genes
2-SNP interaction analysis for a genome-wide scan with 1 million SNPs (3 kb coverage) has 1012 pairwise tests
Multiple independent causes Each cause explains < 10-20% of cases
Common: > 0.1%. In NY city, 12% of the population has Type 2 Diabetes
Multiple testing adjustment Reason for non-reproducible findings
5
Disease Association Search Problem
01012010201022100220110210120021020012001222111000200110022121011101202020100110012012001010001102102200020211120021011000212120
-1-1-1-11111
Disease Status
Non-diseased genotypes: H
Sample population Sof individual genotypes
Risk/resistance factor = multi-SNP combination (MSC) a subset of SNP-columns of S the values of these SNPs, 0, 1, or 2
Cluster C= subset of S with an MSC, d(C) = diseased, h(C) = non-diseased
PROBLEM: Find all MSCs significantly associated with the disease
Diseased genotypes: D
SNPs
6
Significance of Risk/Resistance Factors Measured by
Relative risk (RR) Odds ratio (OR) Their p-values
Unadjusted p-value: Probability of case/control distribution among exposed to risk factor, computed by binomial distribution
Multiple-testing adjustement: Bonferroni
easy to compute overly conservative
Randomization computationally expensive more accurate
7
Exhaustive & Combinatorial Search Exhaustive search is infeasible
sample with n genotypes/m SNPs requires O(n3m)
Combinatorial search Definition: Disease-closure of a multi-SNP combination C is a multi-
SNP combination C’, with maximum number of SNPs, which consists of the same set of disease individuals and minimum number of nondisease individuals.
Searches only closed clusters Closure of cluster C = C’ d(C’)=d(C) and h(C’) is minimized
Avoids checking of trivial MSCs Small d(C) implies not looking in subclusters
Finds faster associated MSCs but still too slow
Tagging: compress S by extracting most informative SNPs restore other SNPs from tag SNPs multiple regression method
8
MLR Tagging
Stepwise Index SNP Algorithm: Choose as a tag the SNP which best predicts all other SNPs Choose the next one which together with a first best predicts all
other SNPs and so on. Prediction method is based on Multiple Linear Regression
(MLR)
So far beats in quality other methods (STAMPA)
9
Data Sets Crohn's disease (Daly et al ): inflammatory bowel disease (IBD). Location: 5q31 Number of SNPs: 103 Population Size: 387 case: 144 control: 243
Autoimmune disorders (Ueda et al) : Location: containing gene CD28, CTLA4 and ICONS Number of SNPs: 108 Population Size: 1024 case: 378 control: 646
Tick-borne encephalitis dataset of (Barkash et al) : Location: containing gene TLR3, PKR, OAS1, OAS2, and OAS3. Number of SNPs: 41 Population Size: 75 case: 21 control: 54
10
Disease association search results IES(30):
exhaustive search 30 indexed SNPs with
MLR based tagging method
ICS(30): combinatorial
search 30 indexed SNPs
with MLR based tagging method.
11
Disease Association Search Optimum Association Search
Problem: Find MSC that is the most associated
with the disease Measure: positive predictive value
= find (non-)diseased-free cluster of maximum size
Bad news: Generalization of max independent set NP complete and cannot be well
approximated Hope: sample S is not arbitrary
12
Complimentary Greedy Algorithm Algorithm
Start with C=S (resp. MCS is empty) Repeat until h(C)=0 (non-diseased-free)
Find 1-SC s maximizing (h(C)-h(C {s})) / (d(C) – d(C {s})) = minimize payment with diseased for removal of non-diseased
Add s to SNPs of C’s MSC
Analogy: finding independent set by greedy removing highest degree vertecies
Extremely fast but inaccurate Can be used in susceptibility prediction
13
Most disease-associated & disease-resistant MSC
Comparison of three methods for searching the disease-associated and disease-resistant multi-SNPs combinations with the largest PPV. The starred values refer toresults of the runtime-constrained exhaustive search
14
Genetic Susceptibility Prediction
Given: Genotypes of diseased and non-diseased individuals, Genotype of a testing person.Find: The disease status of the testing person
01012010201022100220110210120021020012001222111000200110022121011101202020100110012012001010001102102200020211120021011000212120
-1-1-1-11111
Genotype Disease Status
healthy
sick
testing - gt 0110211101211201 s(gt)
15
Cross-validation Leave-one-out test: The disease status of each genotype
in the data set is predicted while the rest of the data is regarded as the training set.
-10101201020102210022011021012002102001200122211100020011002212101
Leave-many-out test: Repeat randomly picking 2/3 of the population as training set and predict the other 1/3.
-1-11
Genotype Real Disease Status
-1-1
1
PredictedDisease Status
1
0020011002212101 1 1
Accuracy = 80%
16
Quality Measures of Prediction (confusion table)
Sensitivity: The ability to correctly detect disease. sensitivity = TP/(TP+FN) Specificity: The ability to avoid calling normal as disease.
specificity = TN/(FP+TN) Accuracy = (TP +TN)/(TP+FP+FN+TN) Risk Rate: Measurements for risk factors.
PredictionDisease
+ -
Test
+True Positive False Positive
(TP) (FP)
-False Negative True Negative
(FN) (TN)
17
Prediction Methods Support vector machine Random forest LP-based prediction
18
Prediction via Clustering Drawback of the prediction problem formulation =
need of cross-validation no optimization Clustering P = partition into clusters defined by
MSC’s Minimizing number of errors S.t. bounded information entropy –∑(Si/S) log(Si/S)
Model-fitting prediction Set status of testing genotype to diseased Find number of errors Set status of testing genotype to diseased Find number of errors Predict status that implies lesser number of errors
19
Leave-1-out cross-validation results
Leave-one-out cross-validation for combinatorial search-based prediction (CSP) and complimentary greedy search-based prediction (CGSP) are given when 20, 30, or all SNPs are chosen as informative SNPs.
20
ROC curve
Comparison of 5 prediction methods on (Barkash et. al,2006 ) data on all SNPs.Area under the CSP’s curve is 0.81 vs 0.52 under the SVM’s curve.
21
Conclusions
Combinatorial search is able to find statistically significant multi-gene interactions, for data where no significant association was detected before
Complimentary greedy search can be used in susceptibility prediction
Optimization approach to prediction New susceptibility prediction is by 8%
higher than the best previously known MLR-tagging efficiently reduces the
datasets allowing to find associated multi-SNP combinations and predict susceptibility
22
International Symposium on Bioinformatics Research and Applications
May 6-9, 2007, Georgia State University, Atlanta, Georgia http://www.cs.gsu.edu/ISBRA/
Submissions must and must not exceed 12 pages in Springer LNCS style The proceedings of ISBRA 2007 will be published in LNBI
Important Dates Submission Deadline Notification of Acceptance Final Version Submission
December 20, 2006 January 31, 2007 February 21, 2007
Symposium Organizers • General Chairs: Dan Gusfield (University of California, Davis) and Yi Pan (Georgia State University) • Program Chairs: Ion Mandoiu (University of Connecticut) and
Alexander Zelikovsky (Georgia State University)
ISBRA provides a forum for the exchange of ideas and results among researchers, developers, and practitioners working on all aspects of bioinformatics and computational biology and their applications
23
History of ISBRA ISBRA is the successor of the International Workshop on Bioinformatics Research and Applications (IWBRA), held on
- May 22-25, 2005 in Atlanta, GA and
- May 28-31, 2006 in Reading, UK
in conjunction with the International Conference on Computational Science
The two editions of IWBRA have enjoyed a great success, with special issues devoted to full versions of selected papers in
Springer LNCS Transactions on Computational Systems Biology and
IEEE/ACM Transactions on Computational Biology and Bioinformatics
24
Support Vector Machine (SVM) Algorithm Learning Task
Given: Genotypes of patients and healthy persons. Compute: A model distinguishing if a person has the
disease. Classification Task
Given: Genotype of a new patient + a learned model Determine: If a patient has the disease or not.
Linear SVM Non-Linear SVM
25
Random Forest Algorithm
Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down to each tree in the forest. Each tree gives a classification, and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest).
Growing Tree, Split selection and Prediction.
Random sub-sample of training data, Random splitter selection.
26
LP-based Prediction Algorithm Model:
Certain haplotypes are susceptible to the disease while others are resistant to the disease.
The genotype susceptibility is assumed to be a sum of susceptibilities of its two haplotypes.
Assign a positive weight to susceptible haplotypes and a negative weight to resistant haplotypes such that for any control genotype the sum of weights of its haplotypes is negative and for any case genotype it is positive.
For each vertex-haplotype hi assign the weight pi,
such that for any genotype-edge ei,j =(hi,,hj )
where s(ei,j ) {-1,1} is the disease status of genotype ei,j. The sum of absolute values of genotype weights is maximized.