supplementary methods, figures, and tables10.1186/s12859-020-036… · figures, and tables maria...

15
Supplementary Methods, Figures, and Tables Maria Osmala and Harri L¨ahdesm¨ aki

Upload: others

Post on 22-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supplementary Methods, Figures, and Tables10.1186/s12859-020-036… · Figures, and Tables Maria Osmala and Harri L ahdesm aki. List of Figures S1 The coverage matrices and the aggregate

Supplementary Methods,Figures, and Tables

Maria Osmala and Harri Lahdesmaki

Page 2: Supplementary Methods, Figures, and Tables10.1186/s12859-020-036… · Figures, and Tables Maria Osmala and Harri L ahdesm aki. List of Figures S1 The coverage matrices and the aggregate

List of Figures

S1 The coverage matrices and the aggregate patterns of the 15 chromatin fea-tures at the 1000 promoter samples. The coverage matrices were visualised asheatmaps together with the aggregate patterns illustrated above the heatmaps.The data originated from the K562 cell line, and the feature patterns wereextracted in a genomic window of length 4 kb centred at the promoter anchorpoints (TSS) indicated by the dashed line and the coordinate 0. The resolu-tion (bin size) of the data was 100 bp. The promoters are unoriented, i.e.,the direction of transcription from TSS is not utilised to direct the coveragepatterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

S2 The coverage matrices and the aggregate patterns of the 15 chromatin fea-tures at the 1000 pure random samples. The coverage matrices were visu-alised as heatmaps together with the aggregate patterns illustrated above theheatmaps. The data originated from the K562 cell line, and the feature pat-terns were extracted in a genomic window of length 4 kb centred at the anchorpoints of the random regions, indicated by the dashed line and the coordinate0. The resolution (bin size) of the data was 100 bp. . . . . . . . . . . . . . 5

S3 The coverage matrices and the aggregate patterns of the 15 chromatin fea-tures at the 1000 random regions with a signal. The coverage matrices werevisualised as heatmaps together with the aggregate patterns illustrated abovethe heatmaps. The data originated from the K562 cell line, and the featurepatterns were extracted in a genomic window of length 4 kb centred at theanchor points of the random regions, indicated by the dashed line and thecoordinate 0. The resolution (bin size) of the data was 100 bp. . . . . . . . 6

S4 The normalised frequencies of the enhancer lengths. The enhancers werepredicted in the GM12878 cell line by PREPRINT and RFECS with thethresholds of 0.5 and 0.75. For each method and threshold, the frequencieswere divided by the total number of regions predicted as enhancers. Theregions were formed by combining the subsequent enhancer predictions into asingle region. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

S5 The proportions of the genome-wide enhancer predictions overlapping thevarying number of TRF ChIP-seq peaks in the GM12878 cell line. The numberof enhancers in each comparison are shown above the figure. In comparison a,the number of enhancers was the number of enhancers predicted by RFECSwith the threshold of 0.25, and in comparison b, the number of enhancers wasthe minimum number of enhancers predicted by PREPRINT methods withtheir 1% FPR thresholds estimated on the K562 CV data. . . . . . . . . . . 9

S6 The validation rate of the genome-wide enhancer predictions obtained by thedifferent methods and thresholds in the GM12878 cell line. An enhancerprediction was validated if the 2 kb prediction window overlapped with atleast 1 bp of at least one TRF ChIP-seq peak. . . . . . . . . . . . . . . . . 10

2

Page 3: Supplementary Methods, Figures, and Tables10.1186/s12859-020-036… · Figures, and Tables Maria Osmala and Harri L ahdesm aki. List of Figures S1 The coverage matrices and the aggregate

S7 The unique and overlapping genome-wide enhancer predictions obtained bydifferent methods in the GM12878 cell line. In comparisons a and c, thePREPRINT predictions were obtained by the ML approach, and in compar-isons b and d, the predictions were obtained by the Bayesian approach. Theoverlap between the PREPRINT, RFECS and ChromHMM predictions werequantified as the number of enhancers. In each comparison, the number ofenhancers predicted by PREPRINT and RFECS was equal. The numberswere: a 17746, b 17746, c 49699, and d 49699. Inside every region or inter-section, the number of enhancers in the given set is indicated together withthe percentage of validated enhancers in the set. The areas of the intersectionsets are not proportional to the number of overlapping regions due to theasymmetry of overlaps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

S8 The unique and overlapping genome-wide enhancer predictions obtained byPREPRINT and RFECS in the K562 cell line (a and b) and in the GM12878cell line (c and d). In each comparison, the number of enhancers predictedby PREPRINT and RFECS was equal. The numbers were: a the minimumnumber of enhancers predicted by PREPRINT with the 1 % FPR thresholdin the K562 cell line (15531), b the number of enhancers predicted by RFECSwith the threshold of 0.25 in the K562 cell line (35089), c the minimum num-ber of enhancers predicted by PREPRINT with the 1 % FPR threshold inthe GM12878 cell line (22088), and d the number of enhancers predicted byRFECS with the threshold of 0.25 in the GM12878 cell line (49699). Insideevery region or intersection, the number of enhancers in the given set is in-dicated together with the percentage of validated enhancers in the set. Theareas in the intersection sets are not proportional to the number of overlappingregions due to the asymmetry of the overlaps. . . . . . . . . . . . . . . . . . 12

List of Tables

S1 The classification performance (AUC) of PREPRINT and RFECS in the 5-fold CV data set from the K562 cell line and the test data from the GM12878cell line. The methods were trained on data containing either the pure randomregions or the random regions with a signal. For RFECS, the AUC values werenot computed on the K562 CV data. The method with the best performanceon data from different cell lines was indicated with the bold font. . . . . . . 7

S2 The number of genome-wide enhancers predicted by different methods andthresholds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Page 4: Supplementary Methods, Figures, and Tables10.1186/s12859-020-036… · Figures, and Tables Maria Osmala and Harri L ahdesm aki. List of Figures S1 The coverage matrices and the aggregate

Supplementary Methods

This sections shortly describes the five modules of the PREPRINT procedure illustrated inFigure 7. The input data for PREPRINT were the chromatin feature data obtained, e.g. bythe ChIP-seq technique. In the Chromatin Immunoprecipitation (ChIP) experiment for apopulation of cells, the protein structures of chromatin were first covalently attached to theDNA, and the DNA was cut into small fragments, for example, by sonication. Sonicationcould not cut the DNA protected by the protein structures, for example, the DNA wrappedaround the nucleosomes with a histone modification H3K4me1 illustrated in Figure 7a.Secondly, the DNA fragments attached to the chromatin feature of interest, e.g., H3K4me1,were enriched in the cell sample with a feature specific antibody. Lastly, the fragmentsenriched by ChIP were sequenced to produce short reads (black H3K4me1 reads in Figure7a). Moreover, a set of control sequence reads were produced; these were reads obtained fromthe same cell population without the ChIP step to provide an estimate of the backgroundcoverage signal (black control reads in Figure 7a).

In the preprocessing module depicted in Figure 7a, the ChIP-seq and control reads werealigned back to the human reference genome. As an example, Figure 7a illustrates readsaligning to a 20 kb genomic region in the first human chromosome. The aligned ChIP-seq andControl reads were processed to obtain the genome-wide read coverage signal, indicated inred for the chromatin feature H3K4me1 in Figure 7a. The enrichment of the ChIP-seq readsand hence the high coverage signal intensity coincided with the chromosomal location ofthe chromatin feature. The ChIP-seq and related techniques can quantify multiple differentchromatin features for the same cell line. In Figure 7a and in the following sections, thenumber of chromatin features was denoted as K. The number of chromatin features in theENCODE data for cell lines K562 and GM12878 was 15.

In the second step of the preprocessing module (Figure 7b), the coverage signals wereextracted at the training data samples to build the training data coverage matrices. Thetraining data samples included enhancers (Class 1), promoters (Class 0), and random ge-nomic locations (Class 0). For more information on the definition of the training datasamples, see the Section Definition of the training data. For example, the training data en-hancers were coordinates to single base pairs in the genome. These coordinates are referredto as anchor points, and they are indicated as the dashed line and the coordinate 0 in Figure7b. Centred at the anchor point, a genomic window of size 2 kb was defined, and the windowwas divided into 100 bp bins. Adopting the 100 bp bin size indicates that the resolution ofthe coverage signal was 100 bp. For every bin along the windows, the coverage signals wereextracted, resulting in a matrix of size n × d. Here, n is the number of the training dataenhancer samples, and d is the length of the row vector. With the genomic window of 2 kband the bin size of 100, the length d is 20. The row vector of the matrix is referred to as afeature pattern vector for a given sample, and the column-wise average signal of the matrix isreferred to as an aggregate pattern of, for example, the enhancer class. The feature patternsin the matrix were visualised as a heatmap together with the aggregate pattern to revealthe biological properties of enhancers. The visualisations were generated by the functions inthe EnrichedHeatmap bioconductor package [1]. The leftmost heatmap and the aggregate

1

Page 5: Supplementary Methods, Figures, and Tables10.1186/s12859-020-036… · Figures, and Tables Maria Osmala and Harri L ahdesm aki. List of Figures S1 The coverage matrices and the aggregate

pattern in Figure 7b represent the coverage matrix of the chromatin feature H3K4me1 at thetraining data enhancers. The heatmaps and aggregate profiles for all 15 chromatin featuresat the training enhancers are illustrated in Figure 8.

The PREPRINT classifier is based on supervised learning. The classifier should learn todistinguish the enhancers (Class 1) from the non-enhancer regions (Class 0). As examplesof non-enhancer class, promoters and random genomic regions were utilised. For detailson defining the training data promoters and random genomic locations, see the SectionDefinition of the training data. The coverage matrices of H3K4me1 at the training datapromoters and random regions are again illustrated as heatmaps together with the aggregatepatterns in Figure 7b. The heatmaps and aggregate patterns for all 15 chromatin featuresat the training promoters and random regions are illustrated in Supplementary Figures S1,S2, and S3, Additional File 1.

According to the aggregate pattern of the chromatin feature MNase-seq at the enhancerspresented in Figure 8, there were two well-positioned nucleosomes flanking the anchor pointsof the enhancers, and the nucleosomes were more mobile when moving further from theanchor points. Co-localising with the two well-positioned nucleosomes, several histone mod-ifications, such as H3K4me1, formed bimodal peaks. In addition to enhancers, these char-acteristic aggregate patterns of different chromatin features can also be defined for the pro-moters and random regions (see Supplementary Figures S1, S2, and S3, Additional File 1).The aggregate patterns of the chromatin feature H3K4me1 for the enhancers, promoters andrandom regions are demonstrated in the statistical modelling module of PREPRINT (Figure7c). The aggregate patterns are indicated as xenh, xprom, and xrand. Given the characteristicenhancer, promoter, or random aggregate pattern, scaled by the scaling factor αenh

i , αpromi ,

or αrandi , respectively, the elements of the individual samples were assumed to follow inde-

pendent Poisson distributions (see the likelihood function in Figure 7c). In Figure 7c, onthe left side, a training data enhancer (yenh

i ) and a non-enhancer (yrandi ) are exemplified.

The aggregate patterns multiplied by the scaling factor (α1ixenh) are depicted as black over

the purple training data enhancer or the non-enhancer sample. The fit of the sample tothe aggregate patterns is an essential concept of the PREPRINT procedure. To quantifythe fit between the aggregate pattern and the individual sample, two probabilistic distancemeasures were defined: PREPRINT ML and PREPRINT Bayesian.

Computing the probabilistic distance measures or scores and building the final trainingdata matrix were the steps of the third module of the PREPRINT procedure (Figure 7d).The distance measures were computed as follows. Firstly, the maximum likelihood (ML)estimates of the scaling parameters αi were obtained for the training data enhancer samples.These ML estimates were employed to compute the probabilistic distance measures in thePREPRINT ML approach. Secondly, the ML estimates were utilised to fit Gamma distribu-tions to obtain the (empirical) priors for the scaling parameters. The priors were employedto compute the probabilistic distance measure in the PREPRINT Bayesian approach. Fi-nally, in the ML approach, the fit of the sample to the aggregate pattern was computedas a likelihood utilising the ML estimates whereas in the Bayesian approach, the fit wascomputed as the posterior predictive values obtained by integrating the likelihood over the

2

Page 6: Supplementary Methods, Figures, and Tables10.1186/s12859-020-036… · Figures, and Tables Maria Osmala and Harri L ahdesm aki. List of Figures S1 The coverage matrices and the aggregate

Gamma prior distribution. For each training data sample, the fit was computed between thesample and the enhancer, promoter, and random aggregate pattern. Therefore, the lengthof the final probabilistic score vector was 3. The probabilistic scores of PREPRINT ML orPREPRINT Bayesian were computed for all training data samples and K different chromatinfeatures to obtain a final training data matrix of size 4n× 3K (Figure 7e). Performing thestatistical modelling of the chromatin features (Figure 7d) and computing the probabilisticscores (Figure 7e) can be recognised as a means to reduce the dimensionality of the original,considerably high dimensional ChIP-seq data of multiple chromatin features.

The fourth module in the PREPRINT procedure was to employ the final training datamatrix to train a support vector machine (SVM) classifier with a Gaussian kernel (Figure7f). Finally, in the fifth module, the PREPRINT classifier was applied to predict enhancersalong the whole human genome. The coverage signals at subsequent 2 kb sliding windowsshifted by 100 bp along the human genome were extracted to compute the probabilistic scores(Figure 7g). The probabilistic scores were classified by PREPRINT resulting in genome-wideprediction scores (Figure 7h). The genomic regions with a prediction score above a chosenthreshold were predicted as enhancers. To pinpoint the exact location of the enhancer, asingle window with the maximal prediction score within the larger prediction region wasselected. The exact locations of enhancers are depicted as blue bars in Figure 7h. In theMethods section, the input data, the PREPRINT modules, and the data processing andanalysis steps within each module are described in more detail.

References

[1] Gu Z, Eils R, Schlesner M, Ishaque N. EnrichedHeatmap: an R/Bioconductor pack-age for comprehensive visualization of genomic signal associations. BMC genomics.2018;19(1):234.

3

Page 7: Supplementary Methods, Figures, and Tables10.1186/s12859-020-036… · Figures, and Tables Maria Osmala and Harri L ahdesm aki. List of Figures S1 The coverage matrices and the aggregate

CTCF

0

1

2

3

H2AZ

24681012

H3K27ac

5

10

15

H3K27me3

�0.5

�0.4

�0.3

�0.2

�0.1

H3K36me3

�0.4�0.2

00.20.40.6

H3K4me1

0.5

1

1.5

2

H3K4me2

5

10

15

20

H3K4me3

2

4

6

8

10

12

CTCF

0

1

2

3

4H2AZ

0

5

10

15

20H3K27ac

0

10

20

30

40H3K27me3

0

0.5

1

1.5

2H3K36me3

0

1

2

3H3K4me1

0

2

4

6

8H3K4me2

0

10

20

30

40H3K4me3

0

5

10

15

20

�2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb

H3K79me2

4

6

8

10

12

H3K9ac

5

10

15

20

25

H3K9me3

�0.4

�0.2

0

H4K20me1

�0.05

0

0.05

0.1

0.15

RNAPOL2

5

10

15

20

DNase seq

50

100

150

MNase seq

4

6

8

10

H3K79me2

0

10

20

30

40H3K9ac

0

10

20

30

40H3K9me3

0

0.5

1

1.5

2H4K20me1

0

0.5

1

1.5

2RNAPOL2

0

5

10

15

20DNase�seq

0

50

100

150MNase�seq

0

5

10

15

20

�2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb

Figure S1: The coverage matrices and the aggregate patterns of the 15 chromatin featuresat the 1000 promoter samples. The coverage matrices were visualised as heatmaps togetherwith the aggregate patterns illustrated above the heatmaps. The data originated from theK562 cell line, and the feature patterns were extracted in a genomic window of length 4 kbcentred at the promoter anchor points (TSS) indicated by the dashed line and the coordinate0. The resolution (bin size) of the data was 100 bp. The promoters are unoriented, i.e., thedirection of transcription from TSS is not utilised to direct the coverage patterns.

4

Page 8: Supplementary Methods, Figures, and Tables10.1186/s12859-020-036… · Figures, and Tables Maria Osmala and Harri L ahdesm aki. List of Figures S1 The coverage matrices and the aggregate

CTCF

�0.1

0

0.1

H2AZ

�0.25

�0.2

�0.15

�0.1

�0.05

H3K27ac

�0.25

�0.2

�0.15

H3K27me3

0.04

0.06

0.08

0.1

0.12

H3K36me3

�0.1

�0.05

0

0.05

H3K4me1

�0.25

�0.2

�0.15

�0.1

�0.05

H3K4me2

�0.3

�0.25

�0.2

H3K4me3

�0.12

�0.1

�0.08

�0.06

CTCF

0

0.5

1

1.5H2AZ

0

0.5

1

1.5

2H3K27ac

0

0.5

1

1.5H3K27me3

0

0.5

1

1.5

2H3K36me3

0

0.5

1

1.5

2H3K4me1

0

0.5

1

1.5

2H3K4me2

0

0.2

0.4

0.6

0.8

1H3K4me3

0

0.2

0.4

0.6

0.8

1

�2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb

H3K79me2

�0.35

�0.3

�0.25

�0.2

�0.15

H3K9ac

�0.25

�0.2

�0.15

�0.1

H3K9me3

�0.05

0

0.05

0.1

H4K20me1

�0.1

�0.05

0

0.05

RNAPOL2

�0.12

�0.1

�0.08

DNase seq

2

2.1

2.2

MNase seq

9.2

9.4

9.6

H3K79me2

0

0.5

1

1.5

2H3K9ac

0

0.5

1

1.5H3K9me3

0

0.5

1

1.5

2H4K20me1

0

0.5

1

1.5

2RNAPOL2

0

0.2

0.4

0.6

0.8

1DNase�seq

0

2

4

6

8MNase�seq

0

5

10

15

20

�2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb

Figure S2: The coverage matrices and the aggregate patterns of the 15 chromatin features atthe 1000 pure random samples. The coverage matrices were visualised as heatmaps togetherwith the aggregate patterns illustrated above the heatmaps. The data originated from theK562 cell line, and the feature patterns were extracted in a genomic window of length 4 kbcentred at the anchor points of the random regions, indicated by the dashed line and thecoordinate 0. The resolution (bin size) of the data was 100 bp.

5

Page 9: Supplementary Methods, Figures, and Tables10.1186/s12859-020-036… · Figures, and Tables Maria Osmala and Harri L ahdesm aki. List of Figures S1 The coverage matrices and the aggregate

CTCF

0.2

0.4

0.6

H2AZ

�0.2

0

0.2

H3K27ac

0.8

0.9

1

1.1

H3K27me3

�0.2

�0.15

�0.1

H3K36me3

1.2

1.3

1.4

H3K4me1

1.7

1.8

1.9

2

2.1

H3K4me2

0.5

0.6

0.7

0.8

0.9

H3K4me3

0.4

0.5

0.6

0.7

CTCF

0

1

2

3H2AZ

0

2

4

6H3K27ac

0

2

4

6H3K27me3

0

1

2

3H3K36me3

0

2

4

6

8H3K4me1

0

2

4

6

8H3K4me2

0

2

4

6H3K4me3

0

1

2

3

4

�2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb

H3K79me2

11

11.5

12

H3K9ac

1

1.2

1.4

H3K9me3

�0.3

�0.2

�0.1

H4K20me1

0.85

0.9

0.95

1

1.05

RNAPOL2

0.4

0.6

0.8

DNase seq

8

9

10

MNase seq

17.5

18

18.5

19

H3K79me2

0

10

20

30

40H3K9ac

0

2

4

6

8H3K9me3

0

1

2

3H4K20me1

0

2

4

6RNAPOL2

0

1

2

3

4DNase�seq

0

10

20

30MNase�seq

0

20

40

60

�2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb �2kb 0 +2kb

Figure S3: The coverage matrices and the aggregate patterns of the 15 chromatin features atthe 1000 random regions with a signal. The coverage matrices were visualised as heatmapstogether with the aggregate patterns illustrated above the heatmaps. The data originatedfrom the K562 cell line, and the feature patterns were extracted in a genomic window oflength 4 kb centred at the anchor points of the random regions, indicated by the dashed lineand the coordinate 0. The resolution (bin size) of the data was 100 bp.

6

Page 10: Supplementary Methods, Figures, and Tables10.1186/s12859-020-036… · Figures, and Tables Maria Osmala and Harri L ahdesm aki. List of Figures S1 The coverage matrices and the aggregate

Table S1: The classification performance (AUC) of PREPRINT and RFECS in the 5-foldCV data set from the K562 cell line and the test data from the GM12878 cell line. Themethods were trained on data containing either the pure random regions or the randomregions with a signal. For RFECS, the AUC values were not computed on the K562 CVdata. The method with the best performance on data from different cell lines was indicatedwith the bold font.

AUC

Method Cell line Pure random regions Random regions with a signal

PREPRINT Bayesian K562 0.994 0.989PREPRINT ML K562 0.992 0.990

PREPRINT Bayesian GM12878 0.991 0.972PREPRINT ML GM12878 0.990 0.972RFECS GM12878 0.988 0.950

Table S2: The number of genome-wide enhancers predicted by different methods and thresh-olds.

The threshold of 0.5 The best operating pointor the threshold of 0.25 for

RFECS

The FPR 1% Threshold

Method Cell random all without TSS threshold all without TSS threshold all without TSS

PREPRINT ML K562 pure 94957 79109 0.57686 86783 72241 0.67285 76660 63839PREPRINT Bayesian K562 pure 75410 60338 0.55273 70795 56675 0.69234 59087 47291RFECS K562 pure 31331 25155 0.25 65515 51743PREPRINT ML K562 with a signal 51030 42657 0.54916 44750 37188 0.7141 28947 23691PREPRINT Bayesian K562 with a signal 42844 35622 0.51043 41677 34634 0.78752 20184 16467RFECS K562 with a signal 18960 15790 0.25 44536 35102ChromHMM Weak Enhancer K562 180471 176912ChromHMM Strong Enhancer K562 69019 66888

PREPRINT ML GM12878 pure 102871 89101 0.57686 93094 80818 0.67285 81113 70671PREPRINT Bayesian GM12878 pure 77304 66978 0.55273 72817 63200 0.69234 60830 53251RFECS GM12878 pure 34619 30407 0.25 99013 86056PREPRINT ML GM12878 with a signal 49679 44651 0.54916 43975 39577 0.7141 28632 25915PREPRINT Bayesian GM12878 with a signal 44684 40127 0.51043 43682 39219 0.78752 20605 18702RFECS GM12878 with a signal 20071 17758 0.25 58600 49739ChromHMM Weak Enhancer GM12878 178474 175487ChromHMM Strong Enhancer GM12878 64052 62599

The thresholds estimated for the GM12878 cell line

PREPRINT ML GM12878 pure 0.1753 169042 147212 0.83915 59170 52063PREPRINT Bayesian GM12878 pure 0.16546 122545 105304 0.8303 47060 41642PREPRINT ML GM12878 with a signal 0.23674 103059 93116 0.82949 18906 17190PREPRINT Bayesian GM12878 with a signal 0.25712 84806 76480 0.83575 17059 15527

7

Page 11: Supplementary Methods, Figures, and Tables10.1186/s12859-020-036… · Figures, and Tables Maria Osmala and Harri L ahdesm aki. List of Figures S1 The coverage matrices and the aggregate

Method

ChromHMM Weak EnhancerChromHMM Strong Enhancer

MLBayesian

RFECS

Threshold

0.5

0.75

104

103

102

101

100

102

103

104

The length of the window with subsequent enhancer predictions

No

rma

lise

d f

req

ue

ncy

Normalised frequency of the lengths of enhancers predicted by different methods

Figure S4: The normalised frequencies of the enhancer lengths. The enhancers were predictedin the GM12878 cell line by PREPRINT and RFECS with the thresholds of 0.5 and 0.75.For each method and threshold, the frequencies were divided by the total number of regionspredicted as enhancers. The regions were formed by combining the subsequent enhancerpredictions into a single region.

8

Page 12: Supplementary Methods, Figures, and Tables10.1186/s12859-020-036… · Figures, and Tables Maria Osmala and Harri L ahdesm aki. List of Figures S1 The coverage matrices and the aggregate

0.0

0.1

0.2

0.3

0 5 10 20 50

Fre

quency

Number of overlapping TRFs

Bayesian

ML

RFECS

RFECS threshold 0.2549699 enhancers

a

0.000

0.025

0.050

0.075

0.100

0.125

0 5 10 20 50

Fre

quency

Bayesian

ML

RFECS

FPR 1 % threshold22088 enhancers

b

Proportions of enhancer predictions

overlapping a varying number of TRF binding sites

Number of overlapping TRFs

Figure S5: The proportions of the genome-wide enhancer predictions overlapping the varyingnumber of TRF ChIP-seq peaks in the GM12878 cell line. The number of enhancers in eachcomparison are shown above the figure. In comparison a, the number of enhancers was thenumber of enhancers predicted by RFECS with the threshold of 0.25, and in comparison b,the number of enhancers was the minimum number of enhancers predicted by PREPRINTmethods with their 1% FPR thresholds estimated on the K562 CV data.

9

Page 13: Supplementary Methods, Figures, and Tables10.1186/s12859-020-036… · Figures, and Tables Maria Osmala and Harri L ahdesm aki. List of Figures S1 The coverage matrices and the aggregate

0.00

0.25

0.50

0.75

17746(RFECS 0.5)

29189(ML 0.5)

38048(Bayesian 0.5)

49699(RFECS 0.25)

175487/62599(ChromHMM)

Number of enhancers

Valid

ation r

ate

Method

ML

Bayesian

RFECS

random

Weak Enhancer

Strong Enhancer

Validation rate of enhancers obtained by different methods and thresholds

Figure S6: The validation rate of the genome-wide enhancer predictions obtained by thedifferent methods and thresholds in the GM12878 cell line. An enhancer prediction wasvalidated if the 2 kb prediction window overlapped with at least 1 bp of at least one TRFChIP-seq peak.

10

Page 14: Supplementary Methods, Figures, and Tables10.1186/s12859-020-036… · Figures, and Tables Maria Osmala and Harri L ahdesm aki. List of Figures S1 The coverage matrices and the aggregate

Figure S7: The unique and overlapping genome-wide enhancer predictions obtained by differ-ent methods in the GM12878 cell line. In comparisons a and c, the PREPRINT predictionswere obtained by the ML approach, and in comparisons b and d, the predictions wereobtained by the Bayesian approach. The overlap between the PREPRINT, RFECS andChromHMM predictions were quantified as the number of enhancers. In each comparison,the number of enhancers predicted by PREPRINT and RFECS was equal. The numberswere: a 17746, b 17746, c 49699, and d 49699. Inside every region or intersection, thenumber of enhancers in the given set is indicated together with the percentage of validatedenhancers in the set. The areas of the intersection sets are not proportional to the numberof overlapping regions due to the asymmetry of overlaps.

11

Page 15: Supplementary Methods, Figures, and Tables10.1186/s12859-020-036… · Figures, and Tables Maria Osmala and Harri L ahdesm aki. List of Figures S1 The coverage matrices and the aggregate

Figure S8: The unique and overlapping genome-wide enhancer predictions obtained byPREPRINT and RFECS in the K562 cell line (a and b) and in the GM12878 cell line(c and d). In each comparison, the number of enhancers predicted by PREPRINT andRFECS was equal. The numbers were: a the minimum number of enhancers predicted byPREPRINT with the 1 % FPR threshold in the K562 cell line (15531), b the number ofenhancers predicted by RFECS with the threshold of 0.25 in the K562 cell line (35089), c theminimum number of enhancers predicted by PREPRINT with the 1 % FPR threshold in theGM12878 cell line (22088), and d the number of enhancers predicted by RFECS with thethreshold of 0.25 in the GM12878 cell line (49699). Inside every region or intersection, thenumber of enhancers in the given set is indicated together with the percentage of validatedenhancers in the set. The areas in the intersection sets are not proportional to the numberof overlapping regions due to the asymmetry of the overlaps.

12