learning the cis regulatory code by predictive modeling of gene regulation (medusa) christina leslie...
TRANSCRIPT
Learning the cis regulatory code by predictive modeling of gene regulation
(MEDUSA)
Christina LeslieCenter for Computational Learning Systems
Columbia University, NY, USA
http://www.cs.columbia.edu/compbio/medusa
Transcriptional Regulation
Nuclear membrane
Binding site/motifCCG__CCG Genome-wide mRNA
transcript data (e.g. microarrays)
Transcriptional Regulation
Nuclear membrane
Binding site/motifCCG__CCG
• Understand which regulators control which target genes
• Discover motifs representing regulatory elements
Learning problems:
Previous work: Clustering
• Cluster-first motif discovery – Cluster genes by expression profile, annotation, …
to find potentially coregulated genes– Find overrepresented motifs in promoter
sequences of similar genes (algorithms: MEME, Consensus, Gibbs sampler, AlignACE, …)
(Spellman et al. 1998)
Previous work: “Structure learning”
• Graphical models (and other methods)– Learn structure of “regulatory network”, “regulatory
modules”, etc. – Fit interpretable model to training data– Model small number of genes or clusters of genes– Many computational and statistical challenges; often used
for qualitative hypotheses rather than prediction
(Segal et al, 2003, 2004)
(Pe’er et al. 2001)
Our work: “Predictive modeling”
• MEDUSA = Motif Element Discrimination Using Sequence Agglomeration
What is the prediction problem?– Predict up/down regulation of target genes under different
experimental conditions
Key ideas:– Learn motifs and identify regulators that predict differential
expression in different contexts mechanistic inputs– Obtain single model for all genes and all experiments:
context-specific, no clusters, no parameter tuning– Accurate predictions on test data
M. Middendorf, A. Kundaje, M. Shah, Y. Freund, C. Wiggins, C. Leslie. Motif Discovery through Predictive Modeling of Gene Regulation. RECOMB 2005.
MEDUSA: Different view of training data
Learn regulatory program that makes genome-wide, context-specific predictions for differential (up/down) expression of target genes
Boosting (Freund & Schapire 1995)
distribution over
training dataweak rule
Minimize exponential loss function
Zt wge expge
tygeht x ge
Boosting (Freund & Schapire 1995)
distribution over
training dataweak rule
updated weights
wget1 wge
t exp tygeht x ge /Z t
MEDUSA’s weak learner
…AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG
GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT…
MEDUSA’s weak learner
…AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG
GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT…
k-mers (k≤7)AGCTATG
MEDUSA’s weak learner
…AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG
GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT…
k-mers (k≤7)AGCTATGGCTATGC
MEDUSA’s weak learner
…AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG
GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT…
k-mers (k≤7)AGCTATGGCTATGCCTATGCC
MEDUSA’s weak learner
…AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG
GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT…
k-mers (k≤7)AGCTATGGCTATGCCTATGCC
MEDUSA’s weak learner
…AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG
GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT…
k-mers (k≤7)AGCTATGGCTATGCCTATGCC
dimers (gapped elements)
TTT_AAA
MEDUSA’s weak learner
…AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG
GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT…
k-mers (k≤7)AGCTATGGCTATGCCTATGCC
dimers (gapped elements)
TTT_AAAGCTA_GCTA
MEDUSA’s weak learner
…AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG
GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT…
k-mers (k≤7)AGCTATGGCTATGCCTATGCC
dimers (gapped elements)
TTT_AAAGCTA_GCTA
MEDUSA’s weak learner
…AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG
GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT…
k-mers (k≤7)AGCTATGGCTATGCCTATGCC
dimers (gapped elements)
TTT_AAAGCTA_GCTA
Regulator expression
Is AGCTATG present and USV1 up?Is AGCTATG present and USV1 down?Is GCTATGC present and USV1 up?Is GCTATGC present and TPK1 up? …
try all motif-regulator pairs as weak rules …
MEDUSA’s weak learner
…AGCTATGCCATCGACTGCTCCAGTCGCACACACAAAGATTTGAG
GCTATAGCTACTTTATAAAGGGGCTACGGCAAATT…
k-mers (k≤7)AGCTATGGCTATGCCTATGCC
dimers (gapped elements)
TTT_AAAGCTA_GCTA
Regulator expression
Is AGCTATG present and USV1 up?Is AGCTATG present and USV1 down?Is GCTATGC present and USV1 up?Is GCTATGC present and TPK1 up? …
try all motif-regulator pairs as weak rules …
minimizes boosting loss
Is GCTATGC present and USV1 up?
Hierarchical sequence agglomeration
Is GCTATGC present and USV1 up?Is GCAATGC present and USV1 up?Is TCTATGC present and USV1 up?Is GCTTTGC present and USV1 up?…
bo
ost
ing
bo
ost
ing
loss
Hierarchical sequence agglomeration
GCTATGCGCAATGCGGTATGCCCTAAGCGCTATTT
…
…
GGTATGG
…
PSSMs
…
Is GCTATGC present and USV1 up?Is GCAATGC present and USV1 up?Is TCTATGC present and USV1 up?Is GCTTTGC present and USV1 up?…
bo
ost
ing
bo
ost
ing
loss
Agglomerate
Hierarchical sequence agglomeration
GCTATGCGCAATGCGGTATGCCCTAAGCGCTATTT
…
…
GGTATGG
…
PSSMs
…
Is GCTATGC present and USV1 up?Is GCAATGC present and USV1 up?Is TCTATGC present and USV1 up?Is GCTTTGC present and USV1 up?…
bo
ost
ing
bo
ost
ing
loss
Optimize over offsets when merging k-mers/PSSMs:
- - GCTATGC GCTATTT - -
Hierarchical sequence agglomeration
GCTATGCGCAATGCGGTATGCCCTAAGCGCTATTT
…
…
GGTATGG
…
PSSMs
…
Is GCTATGC present and USV1 up?Is GCAATGC present and USV1 up?Is TCTATGC present and USV1 up?Is GCTTTGC present and USV1 up?…
bo
ost
ing
bo
ost
ing
loss
Hierarchical sequence agglomeration
GCTATGCGCAATGCGGTATGCCCTAAGCGCTATTT
…
…
GGTATGG
…
PSSMs
…
Is GCTATGC present and USV1 up?Is GCAATGC present and USV1 up?Is TCTATGC present and USV1 up?Is GCTTTGC present and USV1 up?…
bo
ost
ing
bo
ost
ing
loss
Is present and USV1 up?
Is present and USV1 up?
Is present and USV1 up? …
Hierarchical sequence agglomeration
GCTATGCGCAATGCGGTATGCCCTAAGCGCTATTT
…
…
GGTATGG
…
PSSMs
…
Is GCTATGC present and USV1 up?Is GCAATGC present and USV1 up?Is TCTATGC present and USV1 up?Is GCTTTGC present and USV1 up?…
bo
ost
ing
bo
ost
ing
loss
Is present and USV1 up?
Is present and USV1 up?
Is present and USV1 up? …
minimize boosting loss final weak rule
MEDUSA strong rule• Combine weak rules into a tree-structure• Alternating decision tree = margin-based generalization of decision trees
[Freund & Mason 1999]
• Lower nodes are conditionally dependent on higher nodes can possibly reveal combinatorial interactions
• Able to reveal motifs specific to subsets of target genes
• Able to learn any boolean function
Yeast Environmental Stress Response
• Gasch et al. (2000) dataset, 173 microarrays, 13 environmental stresses
• ~5500 target genes, 475 regulators (237 TF+ 250 SM)• 500bp upstream promoter sequences• Binning into +1/0/-1 expression levels based on wildtype
vs. wildtype noise
Statistical validation
• 10-fold cross-validation (held-out experiments), ~60,000 (gene,experiment) training examples, 700 iterations
• (Nk-mers+Ndimers+NPSSMs)*Nreg*2 ~= 107 possible weak rules at every node
• MEDUSA’s motifs give a better prediction accuracyon held-out experiments than database motifs
Yeast ESR: Biological Validation
Important regulators identified by MEDUSA
Cellular localizationof MSN2/4
Segal et al. 2003
Universal stress repressor
Visualizing MEDUSA motifs
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture. AAATTT QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.TAAGGG
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture. QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
1. 2. 3. 5.
8. 14. 16.
• Restrict regulatory program to particular target genes T, experimental conditions E smaller model
• Further statistical pruning of features using margin-based score:
• Identify most significant context-specific regulators and motifs for target set
Biological validation – Context-specific analysis
gT ,eE yge F x ge Ff x ge
• Example: oxygen sensing and regulation in yeast (collaborator: Li Zhang)
Biological validation – Context-specific analysis
• Regulator-motif associations in nodes can have different meanings:
• Need other data to confirm binding relationship between regulator and target (e.g. ChIP chip)
• Still, can determine statistically significant regulator-target relationships from regulation program
TFMTF
PPMp
PMMp
Direct binding Indirect effect Co-occurrence
Biological validation – Network inference
At least 2 usages:• Makes accurate quantitative predictions
– Can assess predictions statistically, i.e. on test data– Gives us confidence that model contains biologically relevant
information
vs.
• Generates biological hypotheses– Without statistical validation, can only evaluate quality of
hypotheses through experiments– Issues: How much of model is correct? How many false
positives? Is a network “edge” a meaningful prediction? (Cf. DREAM initiative)
Discussion: What does “predictive” mean?
• “Manifesto”– We’re interested in hypothesis generation, but still must
give statistical validation on test data, i.e. show that you’re not overfitting
– Not enough to show that model is non-random, e.g. good p-values for functional enrichment
• Possible goal: move towards making useful predictions for actual wet-lab experiments (e.g. fewer input variables in model)
• MEDUSA: statistically predictive model, can still interpret to extract biological hypotheses
Discussion: “Predictive” modeling
• Oxygen sensing and regulation in yeast (collaborator: Li Zhang, Public Health @ Columbia)
• Regulation of and by microRNAs in humans (collaborators: Sander group, Sloan Kettering)
• Sequence information controlling tissue-specific alternative splicing (collaborator: Larry Chasin, Biology @ Columbia)
• Integration of phosphorylation (“kinome”) data to reconstruct signaling pathways
• New Java MEDUSA software package – soon to be released
Ongoing MEDUSA-related projects
http://www.cs.columbia.edu/compbio/medusa
• Manuel Middendorf (Physics)• Anshul Kundaje (CS)• David Quigley (DBMI)• Steve Lianoglou (CS)• Xuejing Li (Physics)• Mihir Shah (CS)• Marta Arias (CCLS)• Chris Wiggins (APAM)• Yoav Freund (CS@UCSD)
Funding: NIH (MAGNet NCBC grant)
Thanks
• ChIP chip: genome-wide protein-DNA binding data, i.e. what promoters are bound by TF?
• Investigate regulatory network model: use ChIP chip data in place of motifs (no motif discovery)– Features: (regulator, TF-occupancy)
pairs
TFP2P1
Biological validation – Binding data
Biological validation – Target gene analysis
• Restrict to target genes = protein chaperones; experiments = heat shock, hypo/hyper-osmolarity
– CMK2 with HSF1 occupancy(CaMKII mammalian ortholog interacts with HSF1)
Biological validation – Signaling molecules
• Find all SMs that associate as regulators with a particular TF’s ChIP occupancy in ADT features
• e.g.
• Hypothesis: Glc7 phosphatase complex interacts with Hsf1 in regulation of Hsf1 targets (Interaction supported in literature)
Hsf1Gac1Gip1Sds22
Glc7 phosphatase
complex
TFSM mRNA
• SVM classifiers with string kernels for remote homology detection, fold recognition
Update: Protein fold recognition
YPNTDIGDPSYPHIGIDIKSVRSKKTAKWNMQNGK
protein sequence
profile
I
G
IDk-mer basedkernel computation
prediction of structural class
SVM
R. Kuang, E. Ie, K. Wang, K. Wang, M. Siddiqi, Y. Freund, C. Leslie. Remote homology detection and motif extraction using profile-based string kernels. JBCB 2005.