computational models for the analysis of gene expression regulation and its alteration
TRANSCRIPT
Computational models for the analysis of geneexpression regulation and its alteration
Anthony Mathelier
Centre for Molecular Medicine Norway (NCMM)Nordic EMBL Partnership for Molecular Medicine
[email protected] @AMathelier
2016 June 14th
1
One genome, multiple cells, transcriptomes, and proteomes
Source : http ://stemcells.nih.gov
Sources : D. Melton, D. Pyott, Google figs, wikicommons.
In human, more than 400 distinct cell types arise from a singletotipotent cell. They all share the same genome but expressdifferent genes in a time- and space-specific manner.
2
Multiple layers of gene expression regulation
TSS
Transcriptional regulation
Post-transcriptionalregulation
Transcription factors, epigenetics,open chromatin, close chromatin, etc.
miRNAs, mRNA localization,RNA-binding proteins, splicing, etc.
adapted from Kelvin Song's work on Wikimedia Commons
Enhancers
Transcription factors
RNA polymerase
PromoterGene
Messenger RNA strand
RNA silencing complex
miRNA
3'
5'
3
Multiple layers of gene expression regulation
TSS
Transcriptional regulation
Post-transcriptionalregulation
Transcription factors, epigenetics,open chromatin, close chromatin, etc.
miRNAs, mRNA localization,RNA-binding proteins, splicing, etc.
Enhancers
Transcription factors
RNA polymerase
PromoterGene
Messenger RNA strand
RNA silencing complex
miRNA
3'
5'
adapted from Kelvin Song's work on Wikimedia Commons
3
Outline
1. Genome scale identification and analyzes of transcriptionfactor binding sites (TFBSs) alterations
I Identification and analysis of cis-regulatory mutationsI Cis-regulatory mutations and gene expression alteration
2. The next generation of TFBS predictionI Transcription factor flexible models (TFFMs)I DNA shape features improve TFBS prediction in vivo
4
Genome scale identification andanalyses of TFBS alterations
TSS
adapted from Kelvin Song's work on Wikimedia Commons
Transcription factors
Gene
5
Whole genome sequencing (WGS) era
Figure from Atif Rahman.
I Previous analyzes of patients’ genomes focused on the ∼ 2%of the genome coding for proteins.
I It becomes affordable to do WGS in the clinic.I It is time to focus on regulatory mutations that alter the
transcriptional regulation of gene expression.6
Cis-regulatory mutations may impact gene expression
adapted from
Arenillas et al., poster at AMIA Conference, 2012.
One needs to accurately locate TFBSs to identify and characterizethe regulatory sequences controlling specific genes transcription.
7
Cis-regulatory mutations may impact gene expression
adapted from
Arenillas et al., poster at AMIA Conference, 2012.
One needs to accurately locate TFBSs to identify and characterizethe regulatory sequences controlling specific genes transcription.
7
Genome-scale data capturing TFBSs : ChIP-seq
>Seq 1AGTTCAAAGTTCAAGTTCAAAGTTCAAGTTCAAAGTTCAAGTTCAAAGTTCA>Seq 2AGTCCAAAGTTCAAGTCCAAAGTTCAAGTCCAAA ...>Seq 33554CTACCGGGGACCGGGGTGGAACCGGGG>Seq 33555ACCGGGGACCGGGGACCGGGGACCGGGGGGCAAGGTTCATA
adapted from
A.M. Szalkowski and C.D. Schmid, Briefings in Bioinformatics, 2010.8
Modeling TFBSs
PFMs reflect the preferred binding motifs associated to TFs.9
Scoring potential TFBSs
10
JASPAR
Largest open-access databaseof manually curated TFbinding profiles.
Mathelier et al., NAR, 2016.
Subset # TF binding profilesVertebrates 519Plants 227Insects 133Nematodes 26Fungi 176Urochordata 1Total 1082
11
TFBSs as cis-regulatory elements
I 477 ChIP-seq data sets from ENCODE and the literature.I 103 TFs with a JASPAR TF binding profile.I 76,160,823 bp in TFBSs (∼ 2% of the human genome).
12
Somatic mutations in B-cell lymphomas
I WGS (normal and tumour cells) :I cohort 1 : 40 diffuse large B-cell lymphomasI cohort 2 : 44 B-cell lymphomas of mixed histology
Morin et al., Blood, 2013 Richter et al., Nat. Genetics, 2012
I RNA-seq for the cancer samples
I 406,611 SNVs and 15,739 indels in cohort 1 samples
I 282,636 SNVs and 8,080 indels in cohort 2 samples
13
Promoters are frequent targets of cis-regulatory mutationsA BCohort 1 Cohort 2
1
2
3
4
5
6
7
89
10
11
12
13
14
15
16
17
18
19
20
21
22
x
y
HIST1H1B
ST6GAL1
TMSB
4X
ZFP36L1
NEDD9
BCL7A
RHOH
BIRC3
CIITA
IGLL5
BTG
2
SGK1
CD74
BCL2
BCL6
1
2
3
4
5
6
7
89
10
11
12
13
14
15
16
17
18
19
20
21
22
x
y
HIST1H1C
HIST1H1E
TMSB
4XZFP36
L1
BCL2L11
BZRAP1
DNMT1
ZNF860
NCOA3FOXO
1
BACH2
CXCR4
DUSP2
RFTN1
TCL1A
BCL7A
SOCS1
SEPT9
P2RX5
S1PR2
RHOH
EPS15
BIRC3
CIITA
DTX1
IGLL5
BTG
2
BTG1 SGK1
CD74CD83
BCL2
BCL6
PIM1
MYC
B2M
IRF1IRF4
LTB
ID3
p = 1.16 x 10-75 p = 3.28 x 10-156
I 75% of the genes previously describedI 13 new genes of interest frequently targeted in their promoters−→ 6 of them exclusively mutated in promoters.
14
Promoters of apoptotic, B-cell, and cancer pathway genesare frequent targets of cis-regulatory mutations
lymphoma
small cell lungcancer
apoptosis
regulation of Bcell
proliferation
positiveregulation of B
cellproliferation
lymphocytedifferentiationnegative
regulation oflymphocyteapoptoticprocess
leukocyteactivation
negativeregulation of Bcell apoptotic
process
regulation of Bcell activation
leukocytedifferentiation
positiveregulation oflymphocyteproliferation
regulation ofcell growth
lymphocyteactivation
T celldifferentiation
positiveregulation of Bcell activation
negativeregulation ofintracellular
signaltransduction
positiveregulation ofleukocyte
proliferation
regulation oftype 2 immune
response
regulation of Bcell apoptotic
process
negativeregulation of T
celldifferentiation
positiveregulation ofmononuclear
cellproliferation
GO BiologicalProcess OMIM
KEGGPathway
FDR < 0.05
15
Landscape of mutations and altered gene expression
mutationTFBSProtein-coding exon
TSS
A
B
C
xseq input
PC and disrupting TFBS
Protein coding (PC)
PC and TFBSDisrupting TFBS
TFBS
MYCEYS
TP53PTPRD
SMARCA4BCL6RYR2ITPKBWWC1
FCGBPSGK1
TBL1XR1ID3
CSMD3SIN3A
VPS13CUNC5D
NBASMTOR
PPP1R16BUSP25ASCC3GPHN
DHX35PEX2
XRCC4PXDNJRKL
WHSC1L1FBXW11
SRP72CRIM1FOXO1DGKDPHIP
PYGLUSP15BRD2
C2CD3LMO4FMN2
SRFBP1N4BP2CCNG1RHOA
DUSP2TGFBR2ARRDC3CADPS2
PIM1STIM2
GNA13
SA32
1012
SA32
0920
SA32
0824
SA32
0932
SA32
1004
SA32
0860
SA32
0992
SA32
0830
SA32
1030
SA32
0914
SA32
0842
SA32
0818
SA32
0980
SA32
0998
SA32
0848
SA32
0866
SA32
1106
SA32
0872
SA32
1119
SA32
0968
SA32
0962
SA32
0944
SA32
1050
SA32
1048
SA32
0956
SA32
1021
SA32
1103
SA32
0836
SA32
0902
SA32
1128
SA32
0974 0 2 4 6 8 10 12
0
5
10
15
20
Ding et al., Nature Communications, 2015.
Mathelier et al., Genome Biology, 2015.
16
Landscape of mutations and altered gene expression
mutationTFBSProtein-coding exon
TSS
A
B
C
xseq input PC and disrupting TFBS
Protein coding (PC)
PC and TFBSDisrupting TFBS
TFBS
MYCEYS
TP53PTPRD
SMARCA4BCL6RYR2ITPKBWWC1
FCGBPSGK1
TBL1XR1ID3
CSMD3SIN3A
VPS13CUNC5D
NBASMTOR
PPP1R16BUSP25ASCC3GPHN
DHX35PEX2
XRCC4PXDNJRKL
WHSC1L1FBXW11
SRP72CRIM1FOXO1DGKDPHIP
PYGLUSP15BRD2
C2CD3LMO4FMN2
SRFBP1N4BP2CCNG1RHOA
DUSP2TGFBR2ARRDC3CADPS2
PIM1STIM2
GNA13
SA32
1012
SA32
0920
SA32
0824
SA32
0932
SA32
1004
SA32
0860
SA32
0992
SA32
0830
SA32
1030
SA32
0914
SA32
0842
SA32
0818
SA32
0980
SA32
0998
SA32
0848
SA32
0866
SA32
1106
SA32
0872
SA32
1119
SA32
0968
SA32
0962
SA32
0944
SA32
1050
SA32
1048
SA32
0956
SA32
1021
SA32
1103
SA32
0836
SA32
0902
SA32
1128
SA32
0974 0 2 4 6 8 10 12
0
5
10
15
20
Ding et al., Nature Communications, 2015.
Mathelier et al., Genome Biology, 2015.
16
Landscape of mutations and altered gene expression
chronic myeloid
leukemia
erbb signaling pathway
acute myeloid leukemia
pancreatic cancer
prostate cancer
endometrial cancer glioma
ecm receptor interaction
focal adhesionepithelial cell signaling in helicobacter
pylori infection
renal cell carcinoma
small cell lung cancer
oxidative phosphorylation
colorectal cancer
RB in Cancer
Integrated Breast Cancer Pathway
Androgen receptor signaling pathway
Focal Adhesion
EGF/EGFR Signaling Pathway
Integrated Pancreatic Cancer Pathway
Signaling Pathways in Glioblastoma
B Cell Receptor Signaling Pathway
Integrin-mediated Cell Adhesion
IL-4 Signaling Pathway
PDGF Pathway
Cardiac Hypertrophic Response
IL-2 Signaling Pathway
MAPK Signaling Pathway
Leptin signaling pathway
AGE/RAGE pathway
Type II interferon signaling
IL-3 Signaling Pathway
Oncostatin M Signaling Pathway
Alpha 6 Beta 4signaling pathway
17
Summary
I We analyzed ∼ 700, 000 somatic mutations from 84 B-celllymphoma samples
I We characterized a set of cis-regulatory elements fromChIP-seq
I Cis-regulatory mutations are enriched in promoter regions ofgenes involved in apoptosis or growth/proliferation
I We combined gene expression and mutation data from thecoding and non-coding spaces
I We highlight candidate regulatory-disrupting variationsdysregulating the gene expression program in cancer pathways
18
The next generation of TFBSprediction
TSS
adapted from Kelvin Song's work on Wikimedia Commons
Transcription factors
Gene
19
Dinucleotide dependenciesPWMs are doing a great job but do not allow for modelingdinucleotide dependencies shown in :
I crystal structures of TF-DNA complexes (Luscombe et al., 2001)
I biochemical studies of specific proteins (Man and Stormo, 2001 ; Bulyk et al.,
2002 ; Berger et al., 2006)
I statistical analysis from TRANSFAC and JASPAR TFBSs(Barash, 2003 ; Tomovic and Oakeley, 2007 ; Zhou and Liu, 2004)
I quantitative analysis of Protein Binding Microarray data (Zhao et
al., 2012)
Results from Zhao, Ruan, Pandey and Stormo, 2012 :I Interactions between neighbouring bases are stronger than
othersI Improvements by considering dinucleotide dependencies are
usually slight with some significant exceptionsI Their method is not applicable to ChIP-seq data
20
Transcription Factor Flexible Models (TFFMs) formodeling TFBS
1st-order HMM :I A state models the backgroundI Each position in the motif is modeled by a stateI Parametrizable transition probabilities for staying in the bg
and switching bg/fgI States emit a nucleotide depending on the previous one
positionn
...... 1
En
1
BG
bg/bg
bg/fg
E0
AAACAGAT
TATCTGTT
GAGCGGGT
CACCCGCT
position1
1
E1
AAACAGAT
TATCTGTT
GAGCGGGT
CACCCGCT
position2
1
E2
AAACAGAT
TATCTGTT
GAGCGGGT
CACCCGCT
AAACAGAT
TATCTGTT
GAGCGGGT
CACCCGCT
Mathelier and Wasserman, PLOS Comp. Biol., 2014.21
TFFMs workflow from ChIP-seq data>HNF4A 1AGTTCAAAGTTCA>HNF4A 2AGTCCAAAGTTCA ...>HNF4A 73554CTTGGAACCGGGG>HNF4A 73555GGCAAGGTTCATA
Sequences
TFFMs
positionn
...... 1
En
BG
bg/bg
bg/fg
E0
AAACAGAT
TATCTGTT
GAGCGGGT
CACCCGCT
position1
1
E1
AAACAGAT
TATCTGTT
GAGCGGGT
CACCCGCT
position2
1
E2
AAACAGAT
TATCTGTT
GAGCGGGT
CACCCGCT
AAACAGAT
TATCTGTT
GAGCGGGT
CACCCGCT
1
Logos
1 2 3 4 5 6 7 8 9 10 11 12 13 14A
B1 2 3 4 5 6 7 8 9 10 11 12 13 14
0
1
2
bits
C 10 11 12 13
Mathelier and Wasserman, PLOS Comp. Biol., 2014.
22
TFFMs are informativeTFFMs allow to analyze the neighbouring dinucleotidedependencies :
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0
1
2
bits
Mathelier and Wasserman, PLOS Comp. Biol., 2014.
23
TFFMs are informativeTFFMs allow to analyze the neighbouring dinucleotidedependencies :
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0
1
2
bits
Mathelier and Wasserman, PLOS Comp. Biol., 2014.
23
TFFMs are informativeTFFMs allow to analyze the neighbouring dinucleotidedependencies :
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0
1
2
bits
Mathelier and Wasserman, PLOS Comp. Biol., 2014.
23
TFFMs are informativeTFFMs allow to analyze the neighbouring dinucleotidedependencies :
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0
1
2
bits
Mathelier and Wasserman, PLOS Comp. Biol., 2014.
23
TFFMs are informativeTFFMs allow to analyze the neighbouring dinucleotidedependencies :
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0
1
2
bits
Mathelier and Wasserman, PLOS Comp. Biol., 2014.
23
TFFMs are informativeTFFMs allow to analyze the neighbouring dinucleotidedependencies :
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0
1
2
bits
Mathelier and Wasserman, PLOS Comp. Biol., 2014.
23
TFFMs are performing better than weight matrices
TFFMsSimilar
WMs
1st-order TFFM
detailed TFFM
PWM
DWM
96 ChIP-seq data sets
AU
C r
atio
0.70
0.75
0.80
0.85
0.90
0.95
1.00
Mathelier and Wasserman, PLOS Comp. Biol., 2014.
24
Allowing for flexible length motifs : JunD
0
20
40
60
80
100
100 80 60 40 20 0
Se
nsitiv
ity
Specificity
1st-order HMMdetailed HMM
PFMDWM
flexible 1st-order HMMflexible detailed HMM
GLAM2
1 2 3 4 5 6 7 8 9 10 11 12 13 14
70% 30%
Method AUC1st-order TFFM 65.41PWM 64.47DWM 64.92flexible 1st-order HMM 71.57
Mathelier and Wasserman, PLOS Comp. Biol., 2014.25
The TFFM framework
The implemented Python framework(http://cisreg.ca/TFFM/doc/) is based on the continuouslymaintained GHMM library (A. Schliep’s group) and allows :
I constructing TFFMs starting from a set of ChIP-seq sequencesI predicting TFBSs within a set of input sequences using a set
of TFFMsI constructing logos associated to a TFFM
26
Transcription Factor Flexible Models
I Pros :I Consider dinucleotide compositionI Are able to model flexible length TFBSsI Better predict TFBSs than PWMs overallI Generate scores consistent with DNA-protein binding affinities
measured experimentallyI Able to easily compute probabilities of occupancy
I Cons :I Need larger data sets than PWM to train the underlying HMMI Need to define a threshold on the TFFM score when making
predictionsI Sequence-based only
27
DNA shape featuresThe DNAshape tool allows for the prediction of four DNA shapefeatures in a high throughput manner.
Considered DNA shape features are :I Minor Groove Width (MGW)I RollI Propeller Twist (ProT)I Helix Twist (HelT)
T. Zhou, et al.. Nucl. Acids Res., 2013.
28
Combining DNA sequence and structure to predict TFBSs
Studies showed DNA shapes importance to model TFBSs from :
I SELEX-seq experiments.I Protein-binding microarray experiments.I BunDLE-seq experiments.
N. Abe et al., Cell, 2015. T. Zhou et al., PNAS, 2015. M. Levo et al., Genome Res., 2015.
Aims of our study :I Construct computational models from large scale in vivo data
(ChIP-seq) by combining DNA sequence and shape features.I Show TFBS prediction improvements on in vivo data.I Analyze whether DNA shape induced improvements are TF
family specific.
29
Combining DNA sequence and structure to predict TFBSs
Studies showed DNA shapes importance to model TFBSs from :
I SELEX-seq experiments.I Protein-binding microarray experiments.I BunDLE-seq experiments.
N. Abe et al., Cell, 2015. T. Zhou et al., PNAS, 2015. M. Levo et al., Genome Res., 2015.
Aims of our study :I Construct computational models from large scale in vivo data
(ChIP-seq) by combining DNA sequence and shape features.I Show TFBS prediction improvements on in vivo data.I Analyze whether DNA shape induced improvements are TF
family specific.
29
Combining DNA sequence and shape features at TFBSs
A G A A G C C A G A A A A G G C A C C C A
PSSM / TFFM
MGW
2nd order MGW
ProT
2nd order ProT
Roll
2nd order Roll
HelT
2nd order HelT
n nucleotides
araTha10Chr1: 27,243,678 - 27,243,702
8n+
1 fe
atur
es
MG
WP
roT
Rol
lH
elT
Feature vector
Hit
scor
e
6.2 Å
2.5 Å
0.6
0
0°
-16.5°
0.8
0
8.6°
-8. 6°
0.4
0
38°
31°
0.7
0
30
Overview of the ChIP-seq data sets considered400 ENCODE ChIP-seq experiments are considered with ChIP’edTFs are coming from 24 TF families.
Family # expBetaBetaAlpha-zinc Finger 146
Helix-Loop-Helix 56Leucine Zipper 55
ETS 21Stat 13GATA 12Rel 10E2F 10
NFY CCAAT-binding 8Hormone-nuclear Receptor 8
Forkhead 8High Mobility Group (Box) 8
Homeodomain 7MADS 7RFX 5
TATA-binding 5NRF 5IRF 4Other 4STAT 4Arid 2HSF 1THAP 1
31
Considering DNA shape features improve predictive powerA
B
DNA shape features are most benefitial for E2F and MADSfamilies.
32
Considering DNA shape features improve predictive powerA
B D
C
DNA shape features are most benefitial for E2F and MADSfamilies.
32
DNA shape features at TFBS flanking sequences furtherimprove discriminative power
A B
33
Summary
I Our analyses of ChIP-seq data reprensent the in vivoconterpart of the published in vitro studies.
I We show that combining DNA sequence and shape featuresimproves the prediction of TFBSs within ChIP-seq peaks.
I We highlight that TFs from the E2F and MADS families mostbenefit from incorporating DNA shape information to predictTFBSs.
I We further improve predictive power when incorporatingDNA-shape features at sequences flanking TFBSs.
34
AcknowledgementsWyeth WassermanAllen ZhangDavid ArenillasRebecca Worsley-Huntall the Wasserman lab
Sohrab ShahCalvin LefebvreJiarui Ding
Boris LenhardGe Tan
Albin SandelinXiaobei ZhaoSandelin lab
Remo RohsBeibei XinTsu-Pei ChiuLin Yang
François Parcy
Centre for Molecular Medicine and Therapeutics
35
TSS
adapted from Kelvin Song's work on Wikimedia Commons
Transcription factors
Gene
Thank you
TSS
adapted from Kelvin Song's work on Wikimedia Commons
Transcription factors
Gene
36