drosophila modencode data integration manolis kellis on behalf of: modencode analysis working group...

29
Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) Computer Science & Artificial Intelligence Laboratory road Institute of MIT and Harvard

Upload: wilfrid-mills

Post on 14-Jan-2016

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

Drosophila modENCODE Data Integration

Manolis Kellis on behalf of:

modEncode Analysis Working Group (AWG)

modEncode Data Analysis Center (DAC)

MIT Computer Science & Artificial Intelligence Laboratory

Broad Institute of MIT and Harvard

Page 2: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

Organism goes here

mod/ENCODE: (aka. everything you wanted to know about gene regulation but were afraid to ask)

Page 3: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

The challenge ahead

Ant

erio

r-P

oste

rior

Dor

sal-V

entr

al

Annotations & images for all expression patterns

Expression domain primitives reveal underlying logic

Binding sites of everydevelopmental regulator

GAF, check

Su(Hw), check

BEAF-32, variant

Mod(mdg4), novel

CP190, novel

CTCF, check

Sequence motifs forevery regulator

Understand regulatory logic specifying development

Page 4: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

The components of genomes and gene regulation

Goal: A systems-level understanding of genomes and gene regulation:• The regulators: TFs, GFs, miRNAs, their specificities• The regions: enhancers, promoters, insulators• The targets: individual regulatory motif instances• The grammars: combinations predictive of tissue-specific activity

The parts list = Building blocks of gene regulation

Our tools: Comparative genomics & large-scale experimental datasets. • Evolutionary signatures for promoter/enhancer/3’UTR motif annotation• Chromatin signatures for integrating histone modification datasets• Sequence signatures associated with TF binding, chromatin, dynamics• Infer regulatory networks, their temporal and spatial dynamics

Integrate diverse datasets

Page 5: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

Outline1. Annotate regulatory regions

– Promoters, enhancers, insulators

2. Annotate chromatin states– De novo learning of chromatin mark combinations

3. Predict TF/Chromatin binding– Sequence -> TFs -> Chromatin -> Expression

4. Infer regulatory networks– Integrate motifs, expression, chromatin

5. Predictive models of gene expression– Chromatin/expression time-course– Embryo expression domains

5

Page 6: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

6

Annotate Regulatory Regions

Promoters, enhancers, insulators

Page 7: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

TFs and Chromatin together define enhancer regions

7

• Evaluate predictive power of TFs/GFs/Chromatin marks in recovery of known enhancers (REDfly, Furlong)

• Combinations across features shows max performance• New enhancers also supported by patterning, motifs

Rachel Sealfon, Chris Bristow

Enrichment in individual features

All features + conservation

All features

Chromatin marks + Remodeling factors

TFs

TFs + Remodeling

Combinations of features improve performance

Frac

tion

of p

redi

ction

s th

at a

re tr

ue (p

reci

sion

)

Number of true enhancers recovered (recall)

Page 8: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

Chromatin marks reveal novel/refined promoters

• Chromatin-based annotation of active promoter regions• Reveal microRNA precursors, lowly-expressed genes, alternate starts• Reveal promoter regions even in absence of CAGE/RACE datasets

Combine shape and intensity of chromatin mark information of six chromatin marks, CBP, PolII

Datasetspositive

negative

Predictions confirmed w/TSS expression, even when CAGE/RACE data is missing

Chris Bristow

Previously-annotatedTSS

Chromatin-basedpromoter

prediction

Tran

scrip

t sup

port

from

mul

tiple

stag

es

No CAGE/RACE evidence

Page 9: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

9

Annotate Chromatin States

De novo learning of mark combinations

Page 10: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

De novo chromatin states from mark combinations

•Learn de novo significant combinations of chromatin marks

•Reveal functional elements, even without looking at sequence

•Use for genome annotation

•Use for studying regulation dynamics in different cell types

10

Promoter states

Transcribed states

Active Intergenic

Repressed

Jason Ernst

Page 11: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

Each chromatin state associated w/ distinct function

• Reveals several classes of promoters, enhancers• Distinct marks in transcripts, exons/introns, 5’/3’ UTRs• Distinguish inactive, repressed, heterochromatin

Tentative annotations

Jason Ernst, Gary Karpen

Frequency of each chromatin mark

20 d

iffer

ent

chro

mat

in s

tate

s

Annotation enrichments

Page 12: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

Positional enrichments of each chromatin state

Jason Ernst, Gary Karpen

Page 13: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

Functional enrichments of different chromatin states

• Developmental patterning regulators enriched in specific states

• Different general factors associated with active/repressed states

• Insulator proteins associated with wide range of chromatin marks

• Replication origins associated with promoter/enhancer regions

• Specific regulatory motifs associated with enhancer/repressed regions

DV regulators AP regulators General TFs Insulators Replication Motifs

Analysis: Jason Ernst, Pouya KheradpourData: David MacAlpine, Kevin White, Gary Karpen

Page 14: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

14

Predictive models of TF/Chromatin

Sequence TFs Chromatin Expression

Page 15: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

Transcription Factor binding highly combinatorial

15

• Extensive cross-enrichment suggests cross-talk between motifs of different TFs

• Enriched and depleted motifs predictive of TF binding

• TF binding prediction increases with motif combinations

• Both synergistic and antagonistic effects

Moti

f enr

ichm

ent

Transcription factor binding

Pouya Kheradpour, Rachel Sealfon

2-4 24

Fold enrichment

Top

moti

f

Know

n m

otif

Top

5 m

otifs

Top/

bott

om 5

moti

fs

All a

bove

1.5

-fold

All a

bove

/bel

ow 1

.5-fo

ld

All m

otifs

Page 16: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

16

1.3 0.7 1.1 1.3 0.8 0.6 1.5 1.5 2.4 0.6 0.9 0.1 0.3 0.2 0.1 1.3 1.4 1.3 0.9 1.01.0 2.2 1.8 0.4 0.3 0.6 0.1 0.4 0.1 0.0 0.2 0.1 0.0 0.0 0.0 0.0 0.3 0.3 5.4 0.30.7 2.6 0.8 0.2 0.1 0.3 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.2 6.4 0.10.8 0.9 0.9 0.8 1.8 1.1 0.7 0.4 0.4 0.2 1.1 0.1 0.1 0.0 0.0 0.5 1.5 2.2 1.4 1.10.4 15.5 0.9 0.2 1.1 1.4 0.1 0.0 0.0 0.0 0.1 0.3 0.1 0.0 0.0 0.1 0.1 0.6 1.8 0.21.2 2.0 3.0 3.6 8.2 7.9 2.3 0.5 0.5 0.6 0.6 0.7 3.2 0.5 0.1 1.2 0.4 0.1 0.2 0.10.8 3.8 3.5 2.6 5.0 8.9 1.9 0.3 0.2 0.3 1.4 3.8 5.2 0.5 0.1 0.6 0.3 0.1 0.1 0.12.0 2.9 2.7 3.6 2.7 2.4 1.4 0.7 0.7 0.2 1.4 0.3 0.6 0.2 0.1 1.0 0.9 0.7 0.8 0.62.0 2.9 2.9 3.3 4.3 5.2 1.0 0.5 0.1 0.3 1.7 2.8 2.9 0.2 0.1 0.6 0.7 0.5 0.5 0.41.9 1.3 2.0 1.7 1.0 0.7 0.3 0.7 0.7 0.1 0.2 0.0 0.2 0.1 0.0 0.3 0.6 0.5 3.1 0.62.0 1.2 2.5 2.6 2.7 1.6 0.6 0.6 0.5 0.3 0.7 0.1 0.5 0.1 0.1 0.6 0.6 0.8 1.5 0.82.0 1.0 1.8 1.7 1.1 0.8 0.5 1.0 0.8 0.1 0.8 0.0 0.2 0.0 0.0 0.5 0.9 1.0 1.1 1.10.5 0.9 0.6 0.7 5.0 2.7 1.9 0.7 0.6 4.6 0.9 3.4 6.1 5.0 4.0 1.5 0.7 0.5 0.1 0.10.0 1.0 0.5 0.4 1.6 1.8 0.7 0.3 0.0 1.4 1.7 13.6 14.4 1.8 2.7 0.6 0.2 0.1 0.0 0.10.2 1.3 0.7 0.8 3.7 4.0 1.2 0.2 0.1 1.6 0.7 7.3 14.5 2.9 2.3 1.3 0.3 0.1 0.0 0.00.2 0.9 1.0 0.9 5.0 6.5 2.2 0.4 0.6 2.6 0.7 3.0 10.3 3.5 1.7 2.0 0.5 0.4 0.0 0.10.2 0.4 0.2 0.1 0.5 0.6 3.0 1.2 3.7 12.3 0.5 1.8 2.5 6.3 5.8 3.5 0.8 0.7 0.0 0.10.7 0.8 0.9 0.9 1.7 1.7 3.1 1.6 3.6 4.8 1.5 1.0 1.5 2.0 1.2 4.2 1.6 1.3 0.3 0.40.2 1.0 0.8 0.1 0.3 0.5 1.8 1.8 1.3 5.2 1.9 2.8 1.5 5.4 4.5 2.7 1.2 0.7 0.4 0.40.1 0.7 0.2 0.1 0.1 0.2 0.7 1.2 0.2 4.4 3.6 9.2 2.0 6.7 9.6 1.6 0.6 0.3 0.0 0.20.0 0.2 0.1 0.1 0.4 0.2 0.8 0.5 0.3 6.2 0.6 3.2 3.7 11.0 11.7 1.8 0.5 0.5 0.0 0.00.0 0.1 0.0 0.0 0.1 0.1 0.2 0.3 0.0 3.1 0.8 8.1 4.6 11.6 12.2 0.6 0.1 0.2 0.0 0.00.0 0.2 0.0 0.0 0.1 0.1 0.2 0.2 0.0 2.6 1.4 15.1 6.5 6.3 10.3 0.4 0.2 0.1 0.0 0.00.1 0.8 0.1 0.2 0.3 0.3 0.4 0.6 0.0 1.1 3.6 18.2 8.1 2.5 6.2 0.5 0.1 0.1 0.0 0.10.2 1.8 0.6 0.3 0.5 1.1 0.7 1.2 0.1 2.5 5.3 8.6 3.1 2.7 3.8 0.8 0.8 0.3 0.2 0.60.3 1.2 0.3 0.2 0.4 0.9 1.0 1.1 0.1 2.7 3.4 8.5 4.4 5.6 7.2 0.9 0.6 0.3 0.1 0.31.1 1.6 1.1 0.8 1.0 1.3 1.3 1.1 0.6 1.1 4.8 1.4 0.6 0.7 0.7 2.1 1.7 1.2 0.5 1.00.8 2.2 1.2 0.6 0.8 1.6 1.6 1.8 0.3 0.9 2.3 1.5 1.3 1.1 0.7 0.5 0.9 0.5 0.5 1.11.4 1.5 1.3 1.8 1.2 1.3 0.3 0.9 0.5 0.1 0.7 0.1 0.1 0.1 0.0 0.8 1.4 1.0 1.1 1.10.9 4.1 1.3 2.1 1.2 1.1 0.3 0.2 0.0 0.0 0.5 0.5 0.6 0.1 0.0 0.3 0.3 0.4 3.5 0.51.1 1.3 0.8 1.1 0.6 0.8 0.8 0.9 0.9 0.2 0.7 0.2 0.2 0.2 0.0 0.8 1.4 1.1 1.4 1.10.8 1.2 0.5 0.5 0.4 0.2 0.6 0.9 0.1 0.0 0.5 0.2 0.0 0.0 0.0 0.7 1.7 1.0 0.8 1.50.8 2.9 1.6 0.4 0.6 0.9 1.1 1.3 0.2 0.0 2.2 0.9 0.4 0.0 0.1 0.6 1.8 0.8 0.3 1.41.5 1.0 1.4 1.8 0.9 0.8 1.1 1.1 0.8 0.3 1.2 0.1 0.3 0.1 0.1 1.0 1.3 1.5 0.7 1.10.8 3.0 1.4 0.4 0.8 2.2 1.7 1.1 0.5 1.7 3.1 5.8 2.9 1.5 2.5 2.0 2.3 0.7 0.1 0.51.4 1.3 0.9 0.9 0.3 0.7 0.7 1.5 0.3 0.0 1.0 0.2 0.2 0.0 0.0 0.5 1.2 0.7 0.6 1.51.7 2.9 2.1 2.0 1.1 1.7 0.6 0.7 0.2 0.3 0.6 0.5 0.5 0.1 0.1 0.8 1.0 0.5 1.7 0.81.7 0.4 0.5 0.8 0.2 0.1 0.3 1.1 0.3 0.0 0.5 0.0 0.0 0.0 0.0 0.3 1.0 1.0 0.8 1.60.8 0.5 0.4 0.2 0.1 0.1 0.3 0.9 0.2 0.0 0.4 0.0 0.0 0.0 0.0 0.2 0.7 0.6 1.3 1.71.0 0.6 0.9 0.9 0.5 0.5 0.9 1.1 1.1 0.2 1.0 0.1 0.2 0.1 0.0 0.9 1.2 1.3 0.8 1.4

Combinations of TFs predictive of chromatin states

Trx in enhancer states

BEAF/Chro in TSSfor ubiquitous genes

Polycomb states enriched for enhancers

AP-state 60-fold enriched in enhancers

Ubiquitous genes enriched for multiple states

Strong Su(Hw) in Negativeoutside promoter states

• Spatial clustering of TF combinations• Compare to chromatin states

(clusters of chromatin marks)• TF sets chromatin states

highly predictive of each other Jason Ernst, Chris Bristow

Page 17: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

Chromatin strong predictor of expression state, not level

• Gene expression level distribution largely bimodal

• Predict presence/absence: chromatin marks in promoter region are a very strong predictor (AUC>0.98)

• Predict expression magnitude: only ~60% of variation explained by promoter marks Many other levels of regulation

Peter Kharchenko, Peter Park

Page 18: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

18

Inferring regulatory networks

Integrate motifs, expression, chromatin

Page 19: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

3. Data integration for improved network prediction

TF Target

Input features used:• Conserved TF motif in target• ChIP binding of TF in target• TF/target co-chromatin marks• TF/target co-expression

Training set: • Edges found in REDfly entwork

Test set: • Cross-validation

Daniel Marbach, Sushmita Roy, Patrick Meyer, Rogerio Candeias

Page 20: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

Integration improves precision and recall

• Linear/logistic regression best, similar to each other use logistic regression• Predictive power of individual features:

– Best: Evolutionarily-conserved motifs– Next: chromatin time-course, ChIP-chip for TFs– Next: chromatin cell-lines, expression data (RNA-seq and microarrays)

• Conclusion: Experimental datasets together dramatically improve performance

Comparison of integration methods

~10% recoveryat ~40% precision

~60% recoveryat ~20% precision

Comparison of individual features

Daniel Marbach, Sushmita Roy, Patrick Meyer, Rogerio Candeias

Page 21: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

• ChIP-grade quality– Similar functional

enrichment– High sens. High spec.

• Systems-level– 81% of Transc. Factors– 86% of microRNAs– 8k + 2k targets– 46k connections

• Lessons learned– Pre- and post- are

correlated (hihi/lolo)– Regulators are heavily

targeted, feedback loop

Initial regulatory network for an animal genome

Pouya Kheradpour, Sushmita Roy, Alex Stark

Page 22: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

22

Predictive models of gene regulation

Chromatin/expression timecourseEmbryo expression domains

Page 23: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

1. Chromatin time-course reveals stage regulators

23

Fold enrichment or over expression

• abd-A motif is enriched in new H3K27me3 regions at L2– Coincides with a drop in the expression of abd-A– Model: sites gain H3K27me3 as abd-A binding lost

• Additional intriguing stories found, to be explored

H3K27me3

Pouya Kheradpour

Page 24: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

2. Predicting changes in time-series expression

• Integrate TF-target motif associations with time-course• Predict positive/negative regulators at each split

Adf1TrlVndTinAbd-AHmxCG11085CG34031EnMadGrhBtdAbd-BFtzAntp…

Adf1

Trl

Adf1E2F

Adf1E2F

gt3Dref

Dref

gtsnatrlesg

Dref

trladf1bynsnatinVndInvTwi…

Dref

Kr

Adf1E2F

tinvnd

exexenhgt

Notice: Adf1 targets appear positively then negatively regulated. Consistent with changes in Adf1 expression (not an input to model)

Adf1 activator is ON(targets induced)

Adf1 activator is OFF(targets not induced)

Jason Ernst

Page 25: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

Target Prediction Coefficients

en

bap

tin

Mef2

twi

Snail

w1

w2

w3

w4

w5

Embryo

w0

Predictive power of inferred network

• Predict target expression as linear comb of TFs, fit wi

• Future: can motif grammars predict weights directly?

Snail, stages 4 to 6

Charlie Frogner, Tom Morgan, Lorenzo Rosasco

Page 26: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

Additional examples: striped, changing coeffs

Adf1

sna

cad

twi

bcd

hb

w1

w2

w3

w4

w5

Embryo

w0

Target Prediction Coefficients

panw6 Hunchback, stages 4 to 6

Target Prediction Coefficients

Trl

sna

hb

Mef2

prd

slp1

w1

w2

w3w4

w5

Embryo

w0

slp1, stages 4 to 6

Charlie Frogner, Tom Morgan, Lorenzo Rosasco

Page 27: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

Outline1. Annotate regulatory regions

– Promoters, enhancers, insulators

2. Annotate chromatin states– De novo learning of chromatin mark combinations

3. Predict TF/Chromatin binding– Sequence -> TFs -> Chromatin -> Expression

4. Infer regulatory networks– Integrate motifs, expression, chromatin

5. Predictive models of gene expression– Chromatin/expression time-course– Embryo expression domains

27

Page 28: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

The challenge ahead

Ant

erio

r-P

oste

rior

Dor

sal-V

entr

al

Annotations & images for all expression patterns

Expression domain primitives reveal underlying logic

Binding sites of everydevelopmental regulator

GAF, check

Su(Hw), check

BEAF-32, variant

Mod(mdg4), novel

CP190, novel

CTCF, check

Sequence motifs forevery regulator

Understand regulatory logic specifying development

Page 29: Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer

29

Drosophila modENCODE Analysis GroupSue CelnikerBrenton GraveleySteve BrennerMichael Brent

Gary KarpenSarah ElginMitzi KurodaVince Pirrotta

Peter Park Peter KharchenkoMichael TolstorukovEric Bishop

Kevin WhiteCasey BrownNicolas NegreNick BildBob Grossman

Eric LaiNicolas Robine

David MacAlpineMatthew Eaton

Steve Henikoff

Peter BickelBen Brown

Lincoln Stein GroupSuzanna LewisGos MicklemNicole WashingtonEO StinsonMarc PerryPeter Ruzanov

AWG

Fly modEncode

Chris BristowPouya KheradpourRachel SealfonJason ErnstMike LinStefan Washietl

Networks groupRogerio CandeiasDaniel MarbachPatrick MeyerSushmita Roy

Image analysisTom MorganCharlie FrognerLorenzo Rosasco