Eric Xing
© Eric Xing @ CMU, 2006-2010 1
Machine LearningMachine Learning
Application I: Application I:
Computational GenomicsComputational Genomics
Eric XingEric Xing
Lecture 18, August 16, 2010
Eric Xing
© Eric Xing @ CMU, 2006-2010 2
Biological Data Analysis
Dynamic, noisy, heterogeneous, high-dimensional data
“High-resolution” inference Parsimonious Scalability Stability Sample complexity Confidence bound
Eric Xing
© Eric Xing @ CMU, 2006-2010 3
Healthy
Sick
ACTCGTACGTAGACCTAGCATTTACGCAATAATGCGA
ACTCGAACCTAGACCTAGCATTTACGCAATAATGCGA
TCTCGTACGTAGACGTAGCATTTACGCAATTATCCGA
ACTCGAACCTAGACCTAGCATTTACGCAATTATCCGA
ACTCGTACGTAGACGTAGCATAACGCAATAATGCGA
TCTCGTACCTAGACGTAGCATAACGCAATAATCCGA
ACTCGAACCTAGACCTAGCATAACGCAATTATCCGA
Single nucleotide polymorphism (SNP)
Causal (or "associated") SNP
Genetic Basis of Diseases
Eric Xing
© Eric Xing @ CMU, 2006-2010 4
A . . . . T . . G . . . . . C . . . . . . T . . . . . A . . G .
A . . . . A . . C . . . . . C . . . . . . T . . . . . A . . G .
T . . . . T . . G . . . . . G . . . . . . T . . . . . T . . C .
A . . . . A . . C . . . . . C . . . . . . T . . . . . T . . C .
A . . . . T . . G . . . . . G . . . . . . A . . . . . A . . G .
T . . . . T . . C . . . . . G . . . . . . A . . . . . A . . C .
A . . . . A . . C . . . . . C . . . . . . A . . . . . T . . C .
Data
Genotype Phenotype
• Cancer: Dunning et al. 2009.• Diabetes: Dupuis et al. 2010.• Atopic dermatitis: Esparza-Gordillo et al. 2009.• Arthritis: Suzuki et al. 2008
Standard Approach
causal SNPcausal SNP
a univariate a univariate phenotypephenotype::e.g., disease/controle.g., disease/control
Genetic Association Mapping
Eric Xing
© Eric Xing @ CMU, 2006-2010 5
Healthy
Cancer
ACTCGTACGTAGACCTAGCATTACGCAATAATGCGA
ACTCGAACCTAGACCTAGCATTACGCAATAATGCGA
TCTCGTACGTAGACGTAGCATTACGCAATTATCCGA
ACTCGAACCTAGACCTAGCATTACGCAATTATCCGA
ACTCGTACGTAGACGTAGCATAACGCAATAATGCGA
TCTCGTACCTAGACGTAGCATAACGCAATAATCCGA
ACTCGAACCTAGACCTAGCATAACGCAATTATCCGA
Causal SNPs
Genetic Basis of Complex Diseases
Eric Xing
© Eric Xing @ CMU, 2006-2010 6
Genetic Basis of Complex Diseases
Healthy
Cancer
ACTCGTACGTAGACCTAGCATTACGCAATAATGCGA
ACTCGAACCTAGACCTAGCATTACGCAATAATGCGA
TCTCGTACGTAGACGTAGCATTACGCAATTATCCGA
ACTCGAACCTAGACCTAGCATTACGCAATTATCCGA
ACTCGTACGTAGACGTAGCATAACGCAATAATGCGA
TCTCGTACCTAGACGTAGCATAACGCAATAATCCGA
ACTCGAACCTAGACCTAGCATAACGCAATTATCCGA
Causal SNPs
Biological mechanism
Eric Xing
© Eric Xing @ CMU, 2006-2010 7
Intermediate Phenotype
Genetic Basis of Complex Diseases
Healthy
Cancer
ACTCGTACGTAGACCTAGCATTACGCAATAATGCGA
ACTCGAACCTAGACCTAGCATTACGCAATAATGCGA
TCTCGTACGTAGACGTAGCATTACGCAATTATCCGA
ACTCGAACCTAGACCTAGCATTACGCAATTATCCGA
ACTCGTACGTAGACGTAGCATAACGCAATAATGCGA
TCTCGTACCTAGACGTAGCATAACGCAATAATCCGA
ACTCGAACCTAGACCTAGCATAACGCAATTATCCGA
Causal SNPs
Clinical records
Gene expression
Association to intermediate phenotypes
Eric Xing
© Eric Xing @ CMU, 2006-2010 8
Association with Phenome
ACGTTTTACTGTACAATT
Multivariate complex syndrome (e.g., asthma)Multivariate complex syndrome (e.g., asthma)age at onset, history of eczemaage at onset, history of eczema
genome-wide expression profilegenome-wide expression profile
ACGTTTTACTGTACAATT
Traditional Approachcausal SNPcausal SNP
a univariate phenotype:a univariate phenotype:gene expression levelgene expression level
Structured Association
Eric Xing
© Eric Xing @ CMU, 2006-2010 9
Considerone phenotype at a time
Consider multiple correlated phenotypes (phenome) jointly
vs.
Standard Approach New Approach
Phenotypes Phenome
Structured Association : a New Paradigm
Eric Xing
© Eric Xing @ CMU, 2006-2010 10
Subnetworks for lung physiology Subnetwork for
quality of life
Genetic Association for Asthma Clinical Traits
TCGACGTTTTACTGTACAATT
Statistical challengeHow to find
associations to a multivariate entity in a graph?
Eric Xing
© Eric Xing @ CMU, 2006-2010 11
Gene Expression
Trait Analysis
TCGACGTTTTACTGTACAATT
Genes
Samples
Statistical challengeHow to find
associations to a multivariate entity in a tree?
Eric Xing
© Eric Xing @ CMU, 2006-2010 12
Sparse Associations
Pleotropic effectsPleotropic effects
Epistatic effectsEpistatic effects
Eric Xing
© Eric Xing @ CMU, 2006-2010 13
Sparse Learning
Linear Model:
Lasso (Sparse Linear Regression)
Why sparse solution?
penalizing
constraining
[R.Tibshirani 96]
Eric Xing
© Eric Xing @ CMU, 2006-2010 14
Multi-Task Extension Multi-Task Linear Model:
Coefficients for k-th task:
Coefficient Matrix:
Input:
Output:
Coefficients for a variable (2nd)
Coefficients for a task (2nd)
Eric Xing
© Eric Xing @ CMU, 2006-2010 15
Background: Sparse multivariate regression for disease association studies
Structured association – a new paradigm Association to a graph-structured phenome
Graph-guided fused lasso (Kim & Xing, PLoS Genetics, 2009)
Association to a tree-structured phenome Tree-guided group lasso (Kim & Xing, ICML 2010)
Outline
Eric Xing
© Eric Xing @ CMU, 2006-2010 16
T G
A A
C C
A T
G A
A G
T A
GenotypeTrait
Multivariate Regression for Single-Trait Association Analysis
Xy
2.1 x=
x= β
Association Strength
?
Eric Xing
© Eric Xing @ CMU, 2006-2010 17
T G
A A
C C
A T
G A
A G
T A
GenotypeTrait
Multivariate Regression for Single-Trait Association Analysis
Many non-zero associations: Which SNPs are truly significant?
2.1 x=
Association Strength
Eric Xing
© Eric Xing @ CMU, 2006-2010 18
Genotype
x=2.1
Trait
Lasso for Reducing False Positives (Tibshirani, 1996)
Many zero associations (sparse results), but what if there are multiple related traits?
+
J
j 1
|βj |
T G
A A
C C
A T
G A
A G
T A
Lasso Penalty for sparsity
Association Strength
Eric Xing
© Eric Xing @ CMU, 2006-2010 19
Genotype
(3.4, 1.5, 2.1, 0.9, 1.8)
Multivariate Regression for Multiple-Trait Association Analysis
T G
A A
C C
A T
G A
A G
T A
How to combine information across multiple traits to increase the power?
Association Strength
x=
Association strength between SNP j and Trait i: βj,i
Allerg
y fo
r cat
s
Allerg
y fo
r roa
ches
Allerg
y in
spr
ing
FEVFEF
Allergy Lung physiology
LD
Syntheti
c lethal
+ ji,
|βj,i|
Eric Xing
© Eric Xing @ CMU, 2006-2010 20
Genotype
(3.4, 1.5, 2.1, 0.9, 1.8)
Trait
Multivariate Regression for Multiple-Trait Association Analysis
T G
A A
C C
A T
G A
A G
T A
Allergy Lung physiology
Association Strength
x=
+We introduce
graph-guided fusion penalty
Association strength between SNP j and Trait i: βj,i
+ ji,
|βj,i|
Eric Xing
© Eric Xing @ CMU, 2006-2010 21
Multiple-trait Association:Graph-Constrained Fused Lasso
Step 1: Thresholded correlation graph of phenotypes
ACGTTTTACTGTACAATT
Step 2: Graph-constrained fused lasso
Lasso Penalty
Graph-constrained fusion penalty
Fusion
Eric Xing
© Eric Xing @ CMU, 2006-2010 22
Fusion Penalty
Fusion Penalty: | βjk - βjm |
For two correlated traits (connected in the network), the association strengths may have similar values.
ACGTTTTACTGTACAATT
SNP j
Trait m
Trait k
Association strength between SNP j and Trait k: βjk
Association strength between SNP j and Trait m: βjm
Eric Xing
© Eric Xing @ CMU, 2006-2010 23
ACGTTTTACTGTACAATT
Overall effect
Graph-Constrained Fused Lasso
Fusion effect propagates to the entire network Association between SNPs and subnetworks of traits
Eric Xing
© Eric Xing @ CMU, 2006-2010 24
ACGTTTTACTGTACAATT
Multiple-trait Association: Graph-Weighted Fused Lasso
Subnetwork structure is embedded as a densely connected nodes with large edge weights
Edges with small weights are effectively ignored
Overall effect
Eric Xing
© Eric Xing @ CMU, 2006-2010 25
Estimating Parameters
Quadratic programming formulation Graph-constrained fused lasso
Graph-weighted fused lasso
Many publicly available software packages for solving convex optimization problems can be used
Eric Xing
© Eric Xing @ CMU, 2006-2010 26
Improving Scalability
Iterative optimization• Update βk
• Update djk’s, djml’s
Original problem
Equivalently
Using a variational formulation
Eric Xing
© Eric Xing @ CMU, 2006-2010 27
Previous Works vs. Our Approach
Previous approach Our approach
PCA-based approach (Weller et al., 1996, Mangin et al., 1998)
Implicit representation of trait correlations
Hard to interpret the derived traits
Explicit representation of trait correlations
Extension of module network for eQTL study (Lee et al., 2009)
Average traits within each trait cluster
Loss of information
Original data for traits are used
Network-based approach (Chen et al., 2008, Emilsson et al., 2008)
Separate association analysis for each trait (no information sharing)
Single-trait association are combined in light of trait network modules
Joint association analysis of multiple traits
Eric Xing
© Eric Xing @ CMU, 2006-2010 28
• 50 SNPs taken from HapMap chromosome 7, CEU population
• 10 traits
Trait Correlation Matrix
True Regression Coefficients
Single SNP-Single Trait Test
Significant at α = 0.01
LassoGraph-guided Fused Lasso
Thresholded Trait Correlation Network
Simulation Results
Phenotypes
SN
Ps
No association
High association
Eric Xing
© Eric Xing @ CMU, 2006-2010 29
Simulation Results
Eric Xing
© Eric Xing @ CMU, 2006-2010 30
Asthma Association Study
543 severe asthma patients from the Severe Asthma Research Program (SARP)
Genotypes : 34 SNPs in IL4R gene 40kb region of chromosome 16 Impute missing genotypes with PHASE (Li and Stephens, 2003)
Traits : 53 asthma-related clinical traits Quality of Life: emotion, environment, activity, symptom Family history: number of siblings with allergy, does the father has asthma? Asthma symptoms: chest tightness, wheeziness
Eric Xing
© Eric Xing @ CMU, 2006-2010 31
Asthma and IL4R Gene
T cell
In normal individuals: allergen ignored by the immune system
Allergen (ragweed, grass, etc.)
Eric Xing
© Eric Xing @ CMU, 2006-2010 32
Asthma and IL4R Gene
T cell
In asthma patientsAllergen (ragweed, grass, etc.)
Th2 cell B cell
ImmunoglobulinE (IgE) antibody
Inflammation!
• Airway in lung narrows• Difficult to breathe• Asthma symptoms
Interleukin-4 Receptor regulates IgE antibody production
Eric Xing
© Eric Xing @ CMU, 2006-2010 33
IL4R Gene
Chr16: 27,325,251-27,376,098
IL4R
SNPs in intron
SNPs in exon
SNPs in promotor region
exonsintrons
Eric Xing
© Eric Xing @ CMU, 2006-2010 34
Asthma Trait Network
Subnetwork for quality of life
Subnetwork for lung physiology
Phenotype Correlation Network
Subnetwork for Asthma symptoms
Eric Xing
© Eric Xing @ CMU, 2006-2010 35
Results from Single-SNP/Trait Test
Single-Marker Single-Trait Test
SN
Ps
Permutation test α = 0.05
Permutation test α = 0.01
No association
High association
Trait Network
Phenotypes
Phe
noty
pes
Lung physiology-related traits I• Baseline FEV1 predicted value: MPVLung • Pre FEF 25-75 predicted value • Average nitric oxide value: online • Body Mass Index • Postbronchodilation FEV1, liters: Spirometry • Baseline FEV1 % predicted: Spirometry • Baseline predrug FEV1, % predicted • Baseline predrug FEV1, % predicted
Q551R SNP• Codes for amino-acid changes in the intracellular signaling portion of the receptor• Exon 11
Eric Xing
© Eric Xing @ CMU, 2006-2010 36
Trait Network
Lasso Graph-guided Fused Lasso
Comparison of Gflasso with Others
Single-Marker Single-Trait Test
SN
Ps
Phenotypes
Phe
noty
pes
?
Lung physiology-related traits I• Baseline FEV1 predicted value: MPVLung • Pre FEF 25-75 predicted value • Average nitric oxide value: online • Body Mass Index • Postbronchodilation FEV1, liters: Spirometry • Baseline FEV1 % predicted: Spirometry • Baseline predrug FEV1, % predicted • Baseline predrug FEV1, % predicted
Q551R SNP• Codes for amino-acid changes in the intracellular signaling portion of the receptor• Exon 11
No association
High association
Eric Xing
© Eric Xing @ CMU, 2006-2010 37
Linkage Disequilibrium Structure in IL-4R gene
SNP Q551R
SNP rs3024660
SNP rs3024622
r2 =0.64
r2 =0.07
Eric Xing
© Eric Xing @ CMU, 2006-2010 38
IL4R Gene
Chr16: 27,325,251-27,376,098
IL4R
SNPs in intron
SNPs in exon
SNPs in promotor region
exonsintrons
Q551R
rs3024660
rs3024622
Eric Xing
© Eric Xing @ CMU, 2006-2010 39
Association to a Tree-structured
Phenome
TCGACGTTTTACTGTACAATT
Eric Xing
© Eric Xing @ CMU, 2006-2010 40
TCGACGTTTTACTGTACAATT
Tree-guided Group Lasso
Why tree? Tree represents a clustering structure
Scalability to a very large number of phenotypes Graph : O(|V|2) edges Tree : O(|V|) edges
Expression quantitative trait mapping (eQTL) Agglomerative hierarchical clustering is a popular tool
Eric Xing
© Eric Xing @ CMU, 2006-2010 41
Tree-Guided Group Lasso
In a simple case of two genes
• Low height• Tight correlation• Joint selection
• Large height• Weak correlation• Separate selection
h
h
Gen
otyp
es
Gen
otyp
es
Eric Xing
© Eric Xing @ CMU, 2006-2010 42
L1 penalty• Lasso penalty• Separate selection
L2 penalty • Group lasso• Joint selection
h
Elastic net
Select the child nodes jointly or separately?
Tree-Guided Group Lasso
In a simple case of two genes
Tree-guided group lasso
Eric Xing
© Eric Xing @ CMU, 2006-2010 43
h2
h1
Select the child nodes jointly or separately?
Joint selection
Separate selection
Tree-Guided Group Lasso
For a general tree
Tree-guided group lasso
Eric Xing
© Eric Xing @ CMU, 2006-2010 44
h2
h1
Select the child nodes jointly or separately?
Tree-Guided Group Lasso
For a general tree
Tree-guided group lasso
Joint selection
Separate selection
Eric Xing
© Eric Xing @ CMU, 2006-2010 45
Previously, in Jenatton, Audibert & Bach, 2009
Balanced Shrinkage
Eric Xing
© Eric Xing @ CMU, 2006-2010 46
Estimating Parameters
Second-order cone program
Many publicly available software packages for solving convex optimization problems can be used
Also, variational formulation
Eric Xing
© Eric Xing @ CMU, 2006-2010 47
Illustration with Simulated Data
True association strengths Lasso
Tree-guided group lasso
SN
Ps
Phenotypes
No association
High association
Eric Xing
© Eric Xing @ CMU, 2006-2010 48
Yeast eQTL Analysis
Tree-guided group lasso
Single-Marker Single-Trait Test
SN
Ps
Phenotypes
No association
High association
Hierarchical clustering tree
Eric Xing
© Eric Xing @ CMU, 2006-2010 49
Structured Association
Phenome Structure
Graph-guided fused lasso (Kim & Xing, PLoS Genetics, 2009)
Graph
Tree-guided fused lasso (Kim & Xing, Submitted)
Tree
Temporally smoothed lasso (Kim, Howrylak, Xing, Submitted)
Dynamic Trait
Genome Structure
Stochastic block regression (Kim & Xing, UAI, 2008)
Linkage Disequilibrium
Multi-population group lasso (Puniyani, Kim, Xing, Submitted)
Population Structure
EpistasisACGTTTTACTGTACAATT
Group lasso with networks(Lee, Kim, Xing, Submitted)
Eric Xing
© Eric Xing @ CMU, 2006-2010 50
Structured Association
Phenome Structure
Graph-guided fused lasso (Kim & Xing, PLoS Genetics, 2009)
Graph
Tree-guided fused lasso (Kim & Xing, Submitted)
Tree
Temporally smoothed lasso (Kim, Howrylak, Xing, Submitted)
Dynamic Trait
Genome Structure
Stochastic block regression (Kim & Xing, UAI, 2008)
Linkage Disequilibrium
Multi-population group lasso (Puniyani, Kim, Xing, Submitted)
Population Structure
EpistasisACGTTTTACTGTACAATT
Group lasso with networks(Lee, Kim, Xing, Submitted)
Eric Xing
© Eric Xing @ CMU, 2006-2010 51
Computation Time
Eric Xing
© Eric Xing @ CMU, 2006-2010 52
Proximal Gradient Descent
Original Problem:
Approximation Problem:
Gradient of theApproximation:
Eric Xing
© Eric Xing @ CMU, 2006-2010 53
Geometric Interpretation
Smooth approximation
Uppermost LineNonsmooth
Uppermost LineSmooth
Eric Xing
© Eric Xing @ CMU, 2006-2010 54
Convergence Rate
Theorem: If we require and set , the
number of iterations is upper bounded by:
Remarks: state of the art IPM method for for SOCP converges at a rate
Eric Xing
© Eric Xing @ CMU, 2006-2010 55
Multi-Task Time Complexity
Pre-compute:
Per-iteration Complexity (computing gradient)
Tree:
Graph:
IPM for SOCP
Proximal-Gradient
IPM for SOCP
Proximal-Gradient
Proximal-Gradient: Independent of Sample Size Linear in #.of Tasks
Eric Xing
© Eric Xing @ CMU, 2006-2010 56
Experiments
Multi-task Graph Structured Sparse Learning (GFlasso)
Eric Xing
© Eric Xing @ CMU, 2006-2010 57
Experiments
Multi-task Tree-Structured Sparse Learning (TreeLasso)
Eric Xing
© Eric Xing @ CMU, 2006-2010 58
Conclusions
Novel statistical methods for joint association analysis to correlated phenotypes Graph-structured phenome : graph-guided fused lasso Tree-structured phenome : tree-guided group lasso
Advantages Greater power to detect weak association signals Fewer false positives Joint association to multiple correlated phenotypes
Other structures In phenotypes: dynamic trait In genotypes: linkage disequilibrium, population structure, epistasis
58
Eric Xing
© Eric Xing @ CMU, 2006-2010 59
Reference• Tibshirani R (1996) Regression shrinkage and selection via the lasso. Journal of Royal Statistical Society, Series B
58:267–288.• Weller J, Wiggans G, Vanraden P, Ron M (1996) Application of a canonical transformation to detection of quantitative trait
loci with the aid of genetic markers in a multi-trait experiment. Theoretical and Applied Genetics 92:998–1002.• Mangin B, Thoquet B, Grimsley N (1998) Pleiotropic QTL analysis. Biometrics 54:89–99.• Chen Y, Zhu J, Lum P, Yang X, Pinto S, et al. (2008) Variations in DNA elucidate molecular networks that cause disease.
Nature 452:429–35.• Lee SI, Dudley A, Drubin D, Silver P, Krogan N, et al. (2009) Learning a prior on regulatory potential from eQTL data.
PLoS Genetics 5:e1000358.• Emilsson V, Thorleifsson G, Zhang B, Leonardson A, Zink F, et al. (2008) Genetics of gene expression and its effect on
disease. Nature 452:423–28.• Kim S, Xing EP (2009) Statistical estimation of correlated genome associations to a quantitative trait network. PLoS
Genetics 5(8): e1000587.• Kim S, Xing EP (2008) Sparse feature learning in high-dimensional space via block regularized regression. In
Proceedings of the 24th International Conference on Uncertainty in Artificial Intelligence (UAI), pages 325-332. AUAI Press.
• Kim S, Xing EP (2010) Exploiting a hierarchical clustering tree of gene-expression traits in eQTL analysis. Submitted.• Kim S, Howrylak J, Xing EP (2010) Dynamic-trait association analysis via temporally-smoothed lasso. Submitted.• Puniyani K, Kim S, Xing EP (2010) Multi-population GWA mapping via multi-task regularized regression. Submitted.• Lee S, Kim S, Xing EP (2010) Leveraging genetic interaction networks and regulatory pathways for joint mapping of
epistatic and marginal eQTLs. Submitted.• Dunning AM et al. (2009) Association of ESR1 gene tagging SNPs with breast cancer risk. Hum Mol Genet. 18(6):1131-9. • Esparza-Gordillo J et al. (2009) A common variant on chromosome 11q13 is associated with atopic dermatitis. Nature
Genetics 41:596-601.• Suzuki A et al. (2008) Functional SNPs in CD244 increase the risk of rheumatoid arthritis in a Japanese population.
Nature Genetics 40:1224-1229.• Dupuis J et al. (2010) New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk.
Nature Genetics 42:105-116.59