computational models for the analysis of gene expression regulation and its alteration

46
Computational models for the analysis of gene expression regulation and its alteration Anthony Mathelier Centre for Molecular Medicine Norway (NCMM) Nordic EMBL Partnership for Molecular Medicine [email protected] @AMathelier 2016 June 14 th 1

Upload: amathelier

Post on 13-Feb-2017

108 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Computational models for the analysis of gene expression regulation and its alteration

Computational models for the analysis of geneexpression regulation and its alteration

Anthony Mathelier

Centre for Molecular Medicine Norway (NCMM)Nordic EMBL Partnership for Molecular Medicine

[email protected] @AMathelier

2016 June 14th

1

Page 2: Computational models for the analysis of gene expression regulation and its alteration

One genome, multiple cells, transcriptomes, and proteomes

Source : http ://stemcells.nih.gov

Sources : D. Melton, D. Pyott, Google figs, wikicommons.

In human, more than 400 distinct cell types arise from a singletotipotent cell. They all share the same genome but expressdifferent genes in a time- and space-specific manner.

2

Page 3: Computational models for the analysis of gene expression regulation and its alteration

Multiple layers of gene expression regulation

TSS

Transcriptional regulation

Post-transcriptionalregulation

Transcription factors, epigenetics,open chromatin, close chromatin, etc.

miRNAs, mRNA localization,RNA-binding proteins, splicing, etc.

adapted from Kelvin Song's work on Wikimedia Commons

Enhancers

Transcription factors

RNA polymerase

PromoterGene

Messenger RNA strand

RNA silencing complex

miRNA

3'

5'

3

Page 4: Computational models for the analysis of gene expression regulation and its alteration

Multiple layers of gene expression regulation

TSS

Transcriptional regulation

Post-transcriptionalregulation

Transcription factors, epigenetics,open chromatin, close chromatin, etc.

miRNAs, mRNA localization,RNA-binding proteins, splicing, etc.

Enhancers

Transcription factors

RNA polymerase

PromoterGene

Messenger RNA strand

RNA silencing complex

miRNA

3'

5'

adapted from Kelvin Song's work on Wikimedia Commons

3

Page 5: Computational models for the analysis of gene expression regulation and its alteration

Outline

1. Genome scale identification and analyzes of transcriptionfactor binding sites (TFBSs) alterations

I Identification and analysis of cis-regulatory mutationsI Cis-regulatory mutations and gene expression alteration

2. The next generation of TFBS predictionI Transcription factor flexible models (TFFMs)I DNA shape features improve TFBS prediction in vivo

4

Page 6: Computational models for the analysis of gene expression regulation and its alteration

Genome scale identification andanalyses of TFBS alterations

TSS

adapted from Kelvin Song's work on Wikimedia Commons

Transcription factors

Gene

5

Page 7: Computational models for the analysis of gene expression regulation and its alteration

Whole genome sequencing (WGS) era

Figure from Atif Rahman.

I Previous analyzes of patients’ genomes focused on the ∼ 2%of the genome coding for proteins.

I It becomes affordable to do WGS in the clinic.I It is time to focus on regulatory mutations that alter the

transcriptional regulation of gene expression.6

Page 8: Computational models for the analysis of gene expression regulation and its alteration

Cis-regulatory mutations may impact gene expression

adapted from

Arenillas et al., poster at AMIA Conference, 2012.

One needs to accurately locate TFBSs to identify and characterizethe regulatory sequences controlling specific genes transcription.

7

Page 9: Computational models for the analysis of gene expression regulation and its alteration

Cis-regulatory mutations may impact gene expression

adapted from

Arenillas et al., poster at AMIA Conference, 2012.

One needs to accurately locate TFBSs to identify and characterizethe regulatory sequences controlling specific genes transcription.

7

Page 10: Computational models for the analysis of gene expression regulation and its alteration

Genome-scale data capturing TFBSs : ChIP-seq

>Seq 1AGTTCAAAGTTCAAGTTCAAAGTTCAAGTTCAAAGTTCAAGTTCAAAGTTCA>Seq 2AGTCCAAAGTTCAAGTCCAAAGTTCAAGTCCAAA ...>Seq 33554CTACCGGGGACCGGGGTGGAACCGGGG>Seq 33555ACCGGGGACCGGGGACCGGGGACCGGGGGGCAAGGTTCATA

adapted from

A.M. Szalkowski and C.D. Schmid, Briefings in Bioinformatics, 2010.8

Page 11: Computational models for the analysis of gene expression regulation and its alteration

Modeling TFBSs

PFMs reflect the preferred binding motifs associated to TFs.9

Page 12: Computational models for the analysis of gene expression regulation and its alteration

Scoring potential TFBSs

10

Page 13: Computational models for the analysis of gene expression regulation and its alteration

JASPAR

Largest open-access databaseof manually curated TFbinding profiles.

Mathelier et al., NAR, 2016.

Subset # TF binding profilesVertebrates 519Plants 227Insects 133Nematodes 26Fungi 176Urochordata 1Total 1082

11

Page 14: Computational models for the analysis of gene expression regulation and its alteration

TFBSs as cis-regulatory elements

I 477 ChIP-seq data sets from ENCODE and the literature.I 103 TFs with a JASPAR TF binding profile.I 76,160,823 bp in TFBSs (∼ 2% of the human genome).

12

Page 15: Computational models for the analysis of gene expression regulation and its alteration

Somatic mutations in B-cell lymphomas

I WGS (normal and tumour cells) :I cohort 1 : 40 diffuse large B-cell lymphomasI cohort 2 : 44 B-cell lymphomas of mixed histology

Morin et al., Blood, 2013 Richter et al., Nat. Genetics, 2012

I RNA-seq for the cancer samples

I 406,611 SNVs and 15,739 indels in cohort 1 samples

I 282,636 SNVs and 8,080 indels in cohort 2 samples

13

Page 16: Computational models for the analysis of gene expression regulation and its alteration

Promoters are frequent targets of cis-regulatory mutationsA BCohort 1 Cohort 2

1

2

3

4

5

6

7

89

10

11

12

13

14

15

16

17

18

19

20

21

22

x

y

HIST1H1B

ST6GAL1

TMSB

4X

ZFP36L1

NEDD9

BCL7A

RHOH

BIRC3

CIITA

IGLL5

BTG

2

SGK1

CD74

BCL2

BCL6

1

2

3

4

5

6

7

89

10

11

12

13

14

15

16

17

18

19

20

21

22

x

y

HIST1H1C

HIST1H1E

TMSB

4XZFP36

L1

BCL2L11

BZRAP1

DNMT1

ZNF860

NCOA3FOXO

1

BACH2

CXCR4

DUSP2

RFTN1

TCL1A

BCL7A

SOCS1

SEPT9

P2RX5

S1PR2

RHOH

EPS15

BIRC3

CIITA

DTX1

IGLL5

BTG

2

BTG1 SGK1

CD74CD83

BCL2

BCL6

PIM1

MYC

B2M

IRF1IRF4

LTB

ID3

p = 1.16 x 10-75 p = 3.28 x 10-156

I 75% of the genes previously describedI 13 new genes of interest frequently targeted in their promoters−→ 6 of them exclusively mutated in promoters.

14

Page 17: Computational models for the analysis of gene expression regulation and its alteration

Promoters of apoptotic, B-cell, and cancer pathway genesare frequent targets of cis-regulatory mutations

lymphoma

small cell lungcancer

apoptosis

regulation of Bcell

proliferation

positiveregulation of B

cellproliferation

lymphocytedifferentiationnegative

regulation oflymphocyteapoptoticprocess

leukocyteactivation

negativeregulation of Bcell apoptotic

process

regulation of Bcell activation

leukocytedifferentiation

positiveregulation oflymphocyteproliferation

regulation ofcell growth

lymphocyteactivation

T celldifferentiation

positiveregulation of Bcell activation

negativeregulation ofintracellular

signaltransduction

positiveregulation ofleukocyte

proliferation

regulation oftype 2 immune

response

regulation of Bcell apoptotic

process

negativeregulation of T

celldifferentiation

positiveregulation ofmononuclear

cellproliferation

GO BiologicalProcess OMIM

KEGGPathway

FDR < 0.05

15

Page 18: Computational models for the analysis of gene expression regulation and its alteration

Landscape of mutations and altered gene expression

mutationTFBSProtein-coding exon

TSS

A

B

C

xseq input

PC and disrupting TFBS

Protein coding (PC)

PC and TFBSDisrupting TFBS

TFBS

MYCEYS

TP53PTPRD

SMARCA4BCL6RYR2ITPKBWWC1

FCGBPSGK1

TBL1XR1ID3

CSMD3SIN3A

VPS13CUNC5D

NBASMTOR

PPP1R16BUSP25ASCC3GPHN

DHX35PEX2

XRCC4PXDNJRKL

WHSC1L1FBXW11

SRP72CRIM1FOXO1DGKDPHIP

PYGLUSP15BRD2

C2CD3LMO4FMN2

SRFBP1N4BP2CCNG1RHOA

DUSP2TGFBR2ARRDC3CADPS2

PIM1STIM2

GNA13

SA32

1012

SA32

0920

SA32

0824

SA32

0932

SA32

1004

SA32

0860

SA32

0992

SA32

0830

SA32

1030

SA32

0914

SA32

0842

SA32

0818

SA32

0980

SA32

0998

SA32

0848

SA32

0866

SA32

1106

SA32

0872

SA32

1119

SA32

0968

SA32

0962

SA32

0944

SA32

1050

SA32

1048

SA32

0956

SA32

1021

SA32

1103

SA32

0836

SA32

0902

SA32

1128

SA32

0974 0 2 4 6 8 10 12

0

5

10

15

20

Ding et al., Nature Communications, 2015.

Mathelier et al., Genome Biology, 2015.

16

Page 19: Computational models for the analysis of gene expression regulation and its alteration

Landscape of mutations and altered gene expression

mutationTFBSProtein-coding exon

TSS

A

B

C

xseq input PC and disrupting TFBS

Protein coding (PC)

PC and TFBSDisrupting TFBS

TFBS

MYCEYS

TP53PTPRD

SMARCA4BCL6RYR2ITPKBWWC1

FCGBPSGK1

TBL1XR1ID3

CSMD3SIN3A

VPS13CUNC5D

NBASMTOR

PPP1R16BUSP25ASCC3GPHN

DHX35PEX2

XRCC4PXDNJRKL

WHSC1L1FBXW11

SRP72CRIM1FOXO1DGKDPHIP

PYGLUSP15BRD2

C2CD3LMO4FMN2

SRFBP1N4BP2CCNG1RHOA

DUSP2TGFBR2ARRDC3CADPS2

PIM1STIM2

GNA13

SA32

1012

SA32

0920

SA32

0824

SA32

0932

SA32

1004

SA32

0860

SA32

0992

SA32

0830

SA32

1030

SA32

0914

SA32

0842

SA32

0818

SA32

0980

SA32

0998

SA32

0848

SA32

0866

SA32

1106

SA32

0872

SA32

1119

SA32

0968

SA32

0962

SA32

0944

SA32

1050

SA32

1048

SA32

0956

SA32

1021

SA32

1103

SA32

0836

SA32

0902

SA32

1128

SA32

0974 0 2 4 6 8 10 12

0

5

10

15

20

Ding et al., Nature Communications, 2015.

Mathelier et al., Genome Biology, 2015.

16

Page 20: Computational models for the analysis of gene expression regulation and its alteration

Landscape of mutations and altered gene expression

chronic myeloid

leukemia

erbb signaling pathway

acute myeloid leukemia

pancreatic cancer

prostate cancer

endometrial cancer glioma

ecm receptor interaction

focal adhesionepithelial cell signaling in helicobacter

pylori infection

renal cell carcinoma

small cell lung cancer

oxidative phosphorylation

colorectal cancer

RB in Cancer

Integrated Breast Cancer Pathway

Androgen receptor signaling pathway

Focal Adhesion

EGF/EGFR Signaling Pathway

Integrated Pancreatic Cancer Pathway

Signaling Pathways in Glioblastoma

B Cell Receptor Signaling Pathway

Integrin-mediated Cell Adhesion

IL-4 Signaling Pathway

PDGF Pathway

Cardiac Hypertrophic Response

IL-2 Signaling Pathway

MAPK Signaling Pathway

Leptin signaling pathway

AGE/RAGE pathway

Type II interferon signaling

IL-3 Signaling Pathway

Oncostatin M Signaling Pathway

Alpha 6 Beta 4signaling pathway

17

Page 21: Computational models for the analysis of gene expression regulation and its alteration

Summary

I We analyzed ∼ 700, 000 somatic mutations from 84 B-celllymphoma samples

I We characterized a set of cis-regulatory elements fromChIP-seq

I Cis-regulatory mutations are enriched in promoter regions ofgenes involved in apoptosis or growth/proliferation

I We combined gene expression and mutation data from thecoding and non-coding spaces

I We highlight candidate regulatory-disrupting variationsdysregulating the gene expression program in cancer pathways

18

Page 22: Computational models for the analysis of gene expression regulation and its alteration

The next generation of TFBSprediction

TSS

adapted from Kelvin Song's work on Wikimedia Commons

Transcription factors

Gene

19

Page 23: Computational models for the analysis of gene expression regulation and its alteration

Dinucleotide dependenciesPWMs are doing a great job but do not allow for modelingdinucleotide dependencies shown in :

I crystal structures of TF-DNA complexes (Luscombe et al., 2001)

I biochemical studies of specific proteins (Man and Stormo, 2001 ; Bulyk et al.,

2002 ; Berger et al., 2006)

I statistical analysis from TRANSFAC and JASPAR TFBSs(Barash, 2003 ; Tomovic and Oakeley, 2007 ; Zhou and Liu, 2004)

I quantitative analysis of Protein Binding Microarray data (Zhao et

al., 2012)

Results from Zhao, Ruan, Pandey and Stormo, 2012 :I Interactions between neighbouring bases are stronger than

othersI Improvements by considering dinucleotide dependencies are

usually slight with some significant exceptionsI Their method is not applicable to ChIP-seq data

20

Page 24: Computational models for the analysis of gene expression regulation and its alteration

Transcription Factor Flexible Models (TFFMs) formodeling TFBS

1st-order HMM :I A state models the backgroundI Each position in the motif is modeled by a stateI Parametrizable transition probabilities for staying in the bg

and switching bg/fgI States emit a nucleotide depending on the previous one

positionn

...... 1

En

1

BG

bg/bg

bg/fg

E0

AAACAGAT

TATCTGTT

GAGCGGGT

CACCCGCT

position1

1

E1

AAACAGAT

TATCTGTT

GAGCGGGT

CACCCGCT

position2

1

E2

AAACAGAT

TATCTGTT

GAGCGGGT

CACCCGCT

AAACAGAT

TATCTGTT

GAGCGGGT

CACCCGCT

Mathelier and Wasserman, PLOS Comp. Biol., 2014.21

Page 25: Computational models for the analysis of gene expression regulation and its alteration

TFFMs workflow from ChIP-seq data>HNF4A 1AGTTCAAAGTTCA>HNF4A 2AGTCCAAAGTTCA ...>HNF4A 73554CTTGGAACCGGGG>HNF4A 73555GGCAAGGTTCATA

Sequences

TFFMs

positionn

...... 1

En

BG

bg/bg

bg/fg

E0

AAACAGAT

TATCTGTT

GAGCGGGT

CACCCGCT

position1

1

E1

AAACAGAT

TATCTGTT

GAGCGGGT

CACCCGCT

position2

1

E2

AAACAGAT

TATCTGTT

GAGCGGGT

CACCCGCT

AAACAGAT

TATCTGTT

GAGCGGGT

CACCCGCT

1

Logos

1 2 3 4 5 6 7 8 9 10 11 12 13 14A

B1 2 3 4 5 6 7 8 9 10 11 12 13 14

0

1

2

bits

C 10 11 12 13

Mathelier and Wasserman, PLOS Comp. Biol., 2014.

22

Page 26: Computational models for the analysis of gene expression regulation and its alteration

TFFMs are informativeTFFMs allow to analyze the neighbouring dinucleotidedependencies :

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

0

1

2

bits

Mathelier and Wasserman, PLOS Comp. Biol., 2014.

23

Page 27: Computational models for the analysis of gene expression regulation and its alteration

TFFMs are informativeTFFMs allow to analyze the neighbouring dinucleotidedependencies :

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

0

1

2

bits

Mathelier and Wasserman, PLOS Comp. Biol., 2014.

23

Page 28: Computational models for the analysis of gene expression regulation and its alteration

TFFMs are informativeTFFMs allow to analyze the neighbouring dinucleotidedependencies :

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

0

1

2

bits

Mathelier and Wasserman, PLOS Comp. Biol., 2014.

23

Page 29: Computational models for the analysis of gene expression regulation and its alteration

TFFMs are informativeTFFMs allow to analyze the neighbouring dinucleotidedependencies :

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

0

1

2

bits

Mathelier and Wasserman, PLOS Comp. Biol., 2014.

23

Page 30: Computational models for the analysis of gene expression regulation and its alteration

TFFMs are informativeTFFMs allow to analyze the neighbouring dinucleotidedependencies :

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

0

1

2

bits

Mathelier and Wasserman, PLOS Comp. Biol., 2014.

23

Page 31: Computational models for the analysis of gene expression regulation and its alteration

TFFMs are informativeTFFMs allow to analyze the neighbouring dinucleotidedependencies :

1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14

0

1

2

bits

Mathelier and Wasserman, PLOS Comp. Biol., 2014.

23

Page 32: Computational models for the analysis of gene expression regulation and its alteration

TFFMs are performing better than weight matrices

TFFMsSimilar

WMs

1st-order TFFM

detailed TFFM

PWM

DWM

96 ChIP-seq data sets

AU

C r

atio

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Mathelier and Wasserman, PLOS Comp. Biol., 2014.

24

Page 33: Computational models for the analysis of gene expression regulation and its alteration

Allowing for flexible length motifs : JunD

0

20

40

60

80

100

100 80 60 40 20 0

Se

nsitiv

ity

Specificity

1st-order HMMdetailed HMM

PFMDWM

flexible 1st-order HMMflexible detailed HMM

GLAM2

1 2 3 4 5 6 7 8 9 10 11 12 13 14

70% 30%

Method AUC1st-order TFFM 65.41PWM 64.47DWM 64.92flexible 1st-order HMM 71.57

Mathelier and Wasserman, PLOS Comp. Biol., 2014.25

Page 34: Computational models for the analysis of gene expression regulation and its alteration

The TFFM framework

The implemented Python framework(http://cisreg.ca/TFFM/doc/) is based on the continuouslymaintained GHMM library (A. Schliep’s group) and allows :

I constructing TFFMs starting from a set of ChIP-seq sequencesI predicting TFBSs within a set of input sequences using a set

of TFFMsI constructing logos associated to a TFFM

26

Page 35: Computational models for the analysis of gene expression regulation and its alteration

Transcription Factor Flexible Models

I Pros :I Consider dinucleotide compositionI Are able to model flexible length TFBSsI Better predict TFBSs than PWMs overallI Generate scores consistent with DNA-protein binding affinities

measured experimentallyI Able to easily compute probabilities of occupancy

I Cons :I Need larger data sets than PWM to train the underlying HMMI Need to define a threshold on the TFFM score when making

predictionsI Sequence-based only

27

Page 36: Computational models for the analysis of gene expression regulation and its alteration

DNA shape featuresThe DNAshape tool allows for the prediction of four DNA shapefeatures in a high throughput manner.

Considered DNA shape features are :I Minor Groove Width (MGW)I RollI Propeller Twist (ProT)I Helix Twist (HelT)

T. Zhou, et al.. Nucl. Acids Res., 2013.

28

Page 37: Computational models for the analysis of gene expression regulation and its alteration

Combining DNA sequence and structure to predict TFBSs

Studies showed DNA shapes importance to model TFBSs from :

I SELEX-seq experiments.I Protein-binding microarray experiments.I BunDLE-seq experiments.

N. Abe et al., Cell, 2015. T. Zhou et al., PNAS, 2015. M. Levo et al., Genome Res., 2015.

Aims of our study :I Construct computational models from large scale in vivo data

(ChIP-seq) by combining DNA sequence and shape features.I Show TFBS prediction improvements on in vivo data.I Analyze whether DNA shape induced improvements are TF

family specific.

29

Page 38: Computational models for the analysis of gene expression regulation and its alteration

Combining DNA sequence and structure to predict TFBSs

Studies showed DNA shapes importance to model TFBSs from :

I SELEX-seq experiments.I Protein-binding microarray experiments.I BunDLE-seq experiments.

N. Abe et al., Cell, 2015. T. Zhou et al., PNAS, 2015. M. Levo et al., Genome Res., 2015.

Aims of our study :I Construct computational models from large scale in vivo data

(ChIP-seq) by combining DNA sequence and shape features.I Show TFBS prediction improvements on in vivo data.I Analyze whether DNA shape induced improvements are TF

family specific.

29

Page 39: Computational models for the analysis of gene expression regulation and its alteration

Combining DNA sequence and shape features at TFBSs

A G A A G C C A G A A A A G G C A C C C A

PSSM / TFFM

MGW

2nd order MGW

ProT

2nd order ProT

Roll

2nd order Roll

HelT

2nd order HelT

n nucleotides

araTha10Chr1: 27,243,678 - 27,243,702

8n+

1 fe

atur

es

MG

WP

roT

Rol

lH

elT

Feature vector

Hit

scor

e

6.2 Å

2.5 Å

0.6

0

-16.5°

0.8

0

8.6°

-8. 6°

0.4

0

38°

31°

0.7

0

30

Page 40: Computational models for the analysis of gene expression regulation and its alteration

Overview of the ChIP-seq data sets considered400 ENCODE ChIP-seq experiments are considered with ChIP’edTFs are coming from 24 TF families.

Family # expBetaBetaAlpha-zinc Finger 146

Helix-Loop-Helix 56Leucine Zipper 55

ETS 21Stat 13GATA 12Rel 10E2F 10

NFY CCAAT-binding 8Hormone-nuclear Receptor 8

Forkhead 8High Mobility Group (Box) 8

Homeodomain 7MADS 7RFX 5

TATA-binding 5NRF 5IRF 4Other 4STAT 4Arid 2HSF 1THAP 1

31

Page 41: Computational models for the analysis of gene expression regulation and its alteration

Considering DNA shape features improve predictive powerA

B

DNA shape features are most benefitial for E2F and MADSfamilies.

32

Page 42: Computational models for the analysis of gene expression regulation and its alteration

Considering DNA shape features improve predictive powerA

B D

C

DNA shape features are most benefitial for E2F and MADSfamilies.

32

Page 43: Computational models for the analysis of gene expression regulation and its alteration

DNA shape features at TFBS flanking sequences furtherimprove discriminative power

A B

33

Page 44: Computational models for the analysis of gene expression regulation and its alteration

Summary

I Our analyses of ChIP-seq data reprensent the in vivoconterpart of the published in vitro studies.

I We show that combining DNA sequence and shape featuresimproves the prediction of TFBSs within ChIP-seq peaks.

I We highlight that TFs from the E2F and MADS families mostbenefit from incorporating DNA shape information to predictTFBSs.

I We further improve predictive power when incorporatingDNA-shape features at sequences flanking TFBSs.

34

Page 45: Computational models for the analysis of gene expression regulation and its alteration

AcknowledgementsWyeth WassermanAllen ZhangDavid ArenillasRebecca Worsley-Huntall the Wasserman lab

Sohrab ShahCalvin LefebvreJiarui Ding

Boris LenhardGe Tan

Albin SandelinXiaobei ZhaoSandelin lab

Remo RohsBeibei XinTsu-Pei ChiuLin Yang

François Parcy

Centre for Molecular Medicine and Therapeutics

35

Page 46: Computational models for the analysis of gene expression regulation and its alteration

TSS

adapted from Kelvin Song's work on Wikimedia Commons

Transcription factors

Gene

Thank you

TSS

adapted from Kelvin Song's work on Wikimedia Commons

Transcription factors

Gene

36