drosophila modencode data integration manolis kellis on behalf of: modencode analysis working group...
TRANSCRIPT
Drosophila modENCODE Data Integration
Manolis Kellis on behalf of:
modEncode Analysis Working Group (AWG)
modEncode Data Analysis Center (DAC)
MIT Computer Science & Artificial Intelligence Laboratory
Broad Institute of MIT and Harvard
Organism goes here
mod/ENCODE: (aka. everything you wanted to know about gene regulation but were afraid to ask)
The challenge ahead
Ant
erio
r-P
oste
rior
Dor
sal-V
entr
al
Annotations & images for all expression patterns
Expression domain primitives reveal underlying logic
Binding sites of everydevelopmental regulator
GAF, check
Su(Hw), check
BEAF-32, variant
Mod(mdg4), novel
CP190, novel
CTCF, check
Sequence motifs forevery regulator
Understand regulatory logic specifying development
The components of genomes and gene regulation
Goal: A systems-level understanding of genomes and gene regulation:• The regulators: TFs, GFs, miRNAs, their specificities• The regions: enhancers, promoters, insulators• The targets: individual regulatory motif instances• The grammars: combinations predictive of tissue-specific activity
The parts list = Building blocks of gene regulation
Our tools: Comparative genomics & large-scale experimental datasets. • Evolutionary signatures for promoter/enhancer/3’UTR motif annotation• Chromatin signatures for integrating histone modification datasets• Sequence signatures associated with TF binding, chromatin, dynamics• Infer regulatory networks, their temporal and spatial dynamics
Integrate diverse datasets
Outline1. Annotate regulatory regions
– Promoters, enhancers, insulators
2. Annotate chromatin states– De novo learning of chromatin mark combinations
3. Predict TF/Chromatin binding– Sequence -> TFs -> Chromatin -> Expression
4. Infer regulatory networks– Integrate motifs, expression, chromatin
5. Predictive models of gene expression– Chromatin/expression time-course– Embryo expression domains
5
6
Annotate Regulatory Regions
Promoters, enhancers, insulators
TFs and Chromatin together define enhancer regions
7
• Evaluate predictive power of TFs/GFs/Chromatin marks in recovery of known enhancers (REDfly, Furlong)
• Combinations across features shows max performance• New enhancers also supported by patterning, motifs
Rachel Sealfon, Chris Bristow
Enrichment in individual features
All features + conservation
All features
Chromatin marks + Remodeling factors
TFs
TFs + Remodeling
Combinations of features improve performance
Frac
tion
of p
redi
ction
s th
at a
re tr
ue (p
reci
sion
)
Number of true enhancers recovered (recall)
Chromatin marks reveal novel/refined promoters
• Chromatin-based annotation of active promoter regions• Reveal microRNA precursors, lowly-expressed genes, alternate starts• Reveal promoter regions even in absence of CAGE/RACE datasets
Combine shape and intensity of chromatin mark information of six chromatin marks, CBP, PolII
Datasetspositive
negative
Predictions confirmed w/TSS expression, even when CAGE/RACE data is missing
Chris Bristow
Previously-annotatedTSS
Chromatin-basedpromoter
prediction
Tran
scrip
t sup
port
from
mul
tiple
stag
es
No CAGE/RACE evidence
9
Annotate Chromatin States
De novo learning of mark combinations
De novo chromatin states from mark combinations
•Learn de novo significant combinations of chromatin marks
•Reveal functional elements, even without looking at sequence
•Use for genome annotation
•Use for studying regulation dynamics in different cell types
10
Promoter states
Transcribed states
Active Intergenic
Repressed
Jason Ernst
Each chromatin state associated w/ distinct function
• Reveals several classes of promoters, enhancers• Distinct marks in transcripts, exons/introns, 5’/3’ UTRs• Distinguish inactive, repressed, heterochromatin
Tentative annotations
Jason Ernst, Gary Karpen
Frequency of each chromatin mark
20 d
iffer
ent
chro
mat
in s
tate
s
Annotation enrichments
Positional enrichments of each chromatin state
Jason Ernst, Gary Karpen
Functional enrichments of different chromatin states
• Developmental patterning regulators enriched in specific states
• Different general factors associated with active/repressed states
• Insulator proteins associated with wide range of chromatin marks
• Replication origins associated with promoter/enhancer regions
• Specific regulatory motifs associated with enhancer/repressed regions
DV regulators AP regulators General TFs Insulators Replication Motifs
Analysis: Jason Ernst, Pouya KheradpourData: David MacAlpine, Kevin White, Gary Karpen
14
Predictive models of TF/Chromatin
Sequence TFs Chromatin Expression
Transcription Factor binding highly combinatorial
15
• Extensive cross-enrichment suggests cross-talk between motifs of different TFs
• Enriched and depleted motifs predictive of TF binding
• TF binding prediction increases with motif combinations
• Both synergistic and antagonistic effects
Moti
f enr
ichm
ent
Transcription factor binding
Pouya Kheradpour, Rachel Sealfon
2-4 24
Fold enrichment
Top
moti
f
Know
n m
otif
Top
5 m
otifs
Top/
bott
om 5
moti
fs
All a
bove
1.5
-fold
All a
bove
/bel
ow 1
.5-fo
ld
All m
otifs
16
1.3 0.7 1.1 1.3 0.8 0.6 1.5 1.5 2.4 0.6 0.9 0.1 0.3 0.2 0.1 1.3 1.4 1.3 0.9 1.01.0 2.2 1.8 0.4 0.3 0.6 0.1 0.4 0.1 0.0 0.2 0.1 0.0 0.0 0.0 0.0 0.3 0.3 5.4 0.30.7 2.6 0.8 0.2 0.1 0.3 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.2 6.4 0.10.8 0.9 0.9 0.8 1.8 1.1 0.7 0.4 0.4 0.2 1.1 0.1 0.1 0.0 0.0 0.5 1.5 2.2 1.4 1.10.4 15.5 0.9 0.2 1.1 1.4 0.1 0.0 0.0 0.0 0.1 0.3 0.1 0.0 0.0 0.1 0.1 0.6 1.8 0.21.2 2.0 3.0 3.6 8.2 7.9 2.3 0.5 0.5 0.6 0.6 0.7 3.2 0.5 0.1 1.2 0.4 0.1 0.2 0.10.8 3.8 3.5 2.6 5.0 8.9 1.9 0.3 0.2 0.3 1.4 3.8 5.2 0.5 0.1 0.6 0.3 0.1 0.1 0.12.0 2.9 2.7 3.6 2.7 2.4 1.4 0.7 0.7 0.2 1.4 0.3 0.6 0.2 0.1 1.0 0.9 0.7 0.8 0.62.0 2.9 2.9 3.3 4.3 5.2 1.0 0.5 0.1 0.3 1.7 2.8 2.9 0.2 0.1 0.6 0.7 0.5 0.5 0.41.9 1.3 2.0 1.7 1.0 0.7 0.3 0.7 0.7 0.1 0.2 0.0 0.2 0.1 0.0 0.3 0.6 0.5 3.1 0.62.0 1.2 2.5 2.6 2.7 1.6 0.6 0.6 0.5 0.3 0.7 0.1 0.5 0.1 0.1 0.6 0.6 0.8 1.5 0.82.0 1.0 1.8 1.7 1.1 0.8 0.5 1.0 0.8 0.1 0.8 0.0 0.2 0.0 0.0 0.5 0.9 1.0 1.1 1.10.5 0.9 0.6 0.7 5.0 2.7 1.9 0.7 0.6 4.6 0.9 3.4 6.1 5.0 4.0 1.5 0.7 0.5 0.1 0.10.0 1.0 0.5 0.4 1.6 1.8 0.7 0.3 0.0 1.4 1.7 13.6 14.4 1.8 2.7 0.6 0.2 0.1 0.0 0.10.2 1.3 0.7 0.8 3.7 4.0 1.2 0.2 0.1 1.6 0.7 7.3 14.5 2.9 2.3 1.3 0.3 0.1 0.0 0.00.2 0.9 1.0 0.9 5.0 6.5 2.2 0.4 0.6 2.6 0.7 3.0 10.3 3.5 1.7 2.0 0.5 0.4 0.0 0.10.2 0.4 0.2 0.1 0.5 0.6 3.0 1.2 3.7 12.3 0.5 1.8 2.5 6.3 5.8 3.5 0.8 0.7 0.0 0.10.7 0.8 0.9 0.9 1.7 1.7 3.1 1.6 3.6 4.8 1.5 1.0 1.5 2.0 1.2 4.2 1.6 1.3 0.3 0.40.2 1.0 0.8 0.1 0.3 0.5 1.8 1.8 1.3 5.2 1.9 2.8 1.5 5.4 4.5 2.7 1.2 0.7 0.4 0.40.1 0.7 0.2 0.1 0.1 0.2 0.7 1.2 0.2 4.4 3.6 9.2 2.0 6.7 9.6 1.6 0.6 0.3 0.0 0.20.0 0.2 0.1 0.1 0.4 0.2 0.8 0.5 0.3 6.2 0.6 3.2 3.7 11.0 11.7 1.8 0.5 0.5 0.0 0.00.0 0.1 0.0 0.0 0.1 0.1 0.2 0.3 0.0 3.1 0.8 8.1 4.6 11.6 12.2 0.6 0.1 0.2 0.0 0.00.0 0.2 0.0 0.0 0.1 0.1 0.2 0.2 0.0 2.6 1.4 15.1 6.5 6.3 10.3 0.4 0.2 0.1 0.0 0.00.1 0.8 0.1 0.2 0.3 0.3 0.4 0.6 0.0 1.1 3.6 18.2 8.1 2.5 6.2 0.5 0.1 0.1 0.0 0.10.2 1.8 0.6 0.3 0.5 1.1 0.7 1.2 0.1 2.5 5.3 8.6 3.1 2.7 3.8 0.8 0.8 0.3 0.2 0.60.3 1.2 0.3 0.2 0.4 0.9 1.0 1.1 0.1 2.7 3.4 8.5 4.4 5.6 7.2 0.9 0.6 0.3 0.1 0.31.1 1.6 1.1 0.8 1.0 1.3 1.3 1.1 0.6 1.1 4.8 1.4 0.6 0.7 0.7 2.1 1.7 1.2 0.5 1.00.8 2.2 1.2 0.6 0.8 1.6 1.6 1.8 0.3 0.9 2.3 1.5 1.3 1.1 0.7 0.5 0.9 0.5 0.5 1.11.4 1.5 1.3 1.8 1.2 1.3 0.3 0.9 0.5 0.1 0.7 0.1 0.1 0.1 0.0 0.8 1.4 1.0 1.1 1.10.9 4.1 1.3 2.1 1.2 1.1 0.3 0.2 0.0 0.0 0.5 0.5 0.6 0.1 0.0 0.3 0.3 0.4 3.5 0.51.1 1.3 0.8 1.1 0.6 0.8 0.8 0.9 0.9 0.2 0.7 0.2 0.2 0.2 0.0 0.8 1.4 1.1 1.4 1.10.8 1.2 0.5 0.5 0.4 0.2 0.6 0.9 0.1 0.0 0.5 0.2 0.0 0.0 0.0 0.7 1.7 1.0 0.8 1.50.8 2.9 1.6 0.4 0.6 0.9 1.1 1.3 0.2 0.0 2.2 0.9 0.4 0.0 0.1 0.6 1.8 0.8 0.3 1.41.5 1.0 1.4 1.8 0.9 0.8 1.1 1.1 0.8 0.3 1.2 0.1 0.3 0.1 0.1 1.0 1.3 1.5 0.7 1.10.8 3.0 1.4 0.4 0.8 2.2 1.7 1.1 0.5 1.7 3.1 5.8 2.9 1.5 2.5 2.0 2.3 0.7 0.1 0.51.4 1.3 0.9 0.9 0.3 0.7 0.7 1.5 0.3 0.0 1.0 0.2 0.2 0.0 0.0 0.5 1.2 0.7 0.6 1.51.7 2.9 2.1 2.0 1.1 1.7 0.6 0.7 0.2 0.3 0.6 0.5 0.5 0.1 0.1 0.8 1.0 0.5 1.7 0.81.7 0.4 0.5 0.8 0.2 0.1 0.3 1.1 0.3 0.0 0.5 0.0 0.0 0.0 0.0 0.3 1.0 1.0 0.8 1.60.8 0.5 0.4 0.2 0.1 0.1 0.3 0.9 0.2 0.0 0.4 0.0 0.0 0.0 0.0 0.2 0.7 0.6 1.3 1.71.0 0.6 0.9 0.9 0.5 0.5 0.9 1.1 1.1 0.2 1.0 0.1 0.2 0.1 0.0 0.9 1.2 1.3 0.8 1.4
Combinations of TFs predictive of chromatin states
Trx in enhancer states
BEAF/Chro in TSSfor ubiquitous genes
Polycomb states enriched for enhancers
AP-state 60-fold enriched in enhancers
Ubiquitous genes enriched for multiple states
Strong Su(Hw) in Negativeoutside promoter states
• Spatial clustering of TF combinations• Compare to chromatin states
(clusters of chromatin marks)• TF sets chromatin states
highly predictive of each other Jason Ernst, Chris Bristow
Chromatin strong predictor of expression state, not level
• Gene expression level distribution largely bimodal
• Predict presence/absence: chromatin marks in promoter region are a very strong predictor (AUC>0.98)
• Predict expression magnitude: only ~60% of variation explained by promoter marks Many other levels of regulation
Peter Kharchenko, Peter Park
18
Inferring regulatory networks
Integrate motifs, expression, chromatin
3. Data integration for improved network prediction
TF Target
Input features used:• Conserved TF motif in target• ChIP binding of TF in target• TF/target co-chromatin marks• TF/target co-expression
Training set: • Edges found in REDfly entwork
Test set: • Cross-validation
Daniel Marbach, Sushmita Roy, Patrick Meyer, Rogerio Candeias
Integration improves precision and recall
• Linear/logistic regression best, similar to each other use logistic regression• Predictive power of individual features:
– Best: Evolutionarily-conserved motifs– Next: chromatin time-course, ChIP-chip for TFs– Next: chromatin cell-lines, expression data (RNA-seq and microarrays)
• Conclusion: Experimental datasets together dramatically improve performance
Comparison of integration methods
~10% recoveryat ~40% precision
~60% recoveryat ~20% precision
Comparison of individual features
Daniel Marbach, Sushmita Roy, Patrick Meyer, Rogerio Candeias
• ChIP-grade quality– Similar functional
enrichment– High sens. High spec.
• Systems-level– 81% of Transc. Factors– 86% of microRNAs– 8k + 2k targets– 46k connections
• Lessons learned– Pre- and post- are
correlated (hihi/lolo)– Regulators are heavily
targeted, feedback loop
Initial regulatory network for an animal genome
Pouya Kheradpour, Sushmita Roy, Alex Stark
22
Predictive models of gene regulation
Chromatin/expression timecourseEmbryo expression domains
1. Chromatin time-course reveals stage regulators
23
Fold enrichment or over expression
• abd-A motif is enriched in new H3K27me3 regions at L2– Coincides with a drop in the expression of abd-A– Model: sites gain H3K27me3 as abd-A binding lost
• Additional intriguing stories found, to be explored
H3K27me3
Pouya Kheradpour
2. Predicting changes in time-series expression
• Integrate TF-target motif associations with time-course• Predict positive/negative regulators at each split
Adf1TrlVndTinAbd-AHmxCG11085CG34031EnMadGrhBtdAbd-BFtzAntp…
Adf1
Trl
Adf1E2F
Adf1E2F
gt3Dref
Dref
gtsnatrlesg
Dref
trladf1bynsnatinVndInvTwi…
Dref
Kr
Adf1E2F
tinvnd
exexenhgt
Notice: Adf1 targets appear positively then negatively regulated. Consistent with changes in Adf1 expression (not an input to model)
Adf1 activator is ON(targets induced)
Adf1 activator is OFF(targets not induced)
Jason Ernst
Target Prediction Coefficients
en
bap
tin
Mef2
twi
Snail
w1
w2
w3
w4
w5
Embryo
w0
Predictive power of inferred network
• Predict target expression as linear comb of TFs, fit wi
• Future: can motif grammars predict weights directly?
Snail, stages 4 to 6
Charlie Frogner, Tom Morgan, Lorenzo Rosasco
Additional examples: striped, changing coeffs
Adf1
sna
cad
twi
bcd
hb
w1
w2
w3
w4
w5
Embryo
w0
Target Prediction Coefficients
panw6 Hunchback, stages 4 to 6
Target Prediction Coefficients
Trl
sna
hb
Mef2
prd
slp1
w1
w2
w3w4
w5
Embryo
w0
slp1, stages 4 to 6
Charlie Frogner, Tom Morgan, Lorenzo Rosasco
Outline1. Annotate regulatory regions
– Promoters, enhancers, insulators
2. Annotate chromatin states– De novo learning of chromatin mark combinations
3. Predict TF/Chromatin binding– Sequence -> TFs -> Chromatin -> Expression
4. Infer regulatory networks– Integrate motifs, expression, chromatin
5. Predictive models of gene expression– Chromatin/expression time-course– Embryo expression domains
27
The challenge ahead
Ant
erio
r-P
oste
rior
Dor
sal-V
entr
al
Annotations & images for all expression patterns
Expression domain primitives reveal underlying logic
Binding sites of everydevelopmental regulator
GAF, check
Su(Hw), check
BEAF-32, variant
Mod(mdg4), novel
CP190, novel
CTCF, check
Sequence motifs forevery regulator
Understand regulatory logic specifying development
29
Drosophila modENCODE Analysis GroupSue CelnikerBrenton GraveleySteve BrennerMichael Brent
Gary KarpenSarah ElginMitzi KurodaVince Pirrotta
Peter Park Peter KharchenkoMichael TolstorukovEric Bishop
Kevin WhiteCasey BrownNicolas NegreNick BildBob Grossman
Eric LaiNicolas Robine
David MacAlpineMatthew Eaton
Steve Henikoff
Peter BickelBen Brown
Lincoln Stein GroupSuzanna LewisGos MicklemNicole WashingtonEO StinsonMarc PerryPeter Ruzanov
AWG
Fly modEncode
Chris BristowPouya KheradpourRachel SealfonJason ErnstMike LinStefan Washietl
Networks groupRogerio CandeiasDaniel MarbachPatrick MeyerSushmita Roy
Image analysisTom MorganCharlie FrognerLorenzo Rosasco