a survey of the approaches for identifying differential ... · pdf filea survey of the...

25
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/315612843 A survey of the approaches for identifying differential methylation using bisulfite sequencing data Article in Briefings in Bioinformatics · March 2017 DOI: 10.1093/bib/bbx013 CITATIONS 0 READS 11 4 authors: Some of the authors of this publication are also working on these related projects: disease subtype discovery based on integration of multiple types of omics data View project Adib Shafi Wayne State University 2 PUBLICATIONS 0 CITATIONS SEE PROFILE Cristina Mitrea Wayne State University 9 PUBLICATIONS 57 CITATIONS SEE PROFILE Tin Nguyen Wayne State University 16 PUBLICATIONS 18 CITATIONS SEE PROFILE Sorin Draghici Wayne State University 225 PUBLICATIONS 7,279 CITATIONS SEE PROFILE All content following this page was uploaded by Adib Shafi on 24 April 2017. The user has requested enhancement of the downloaded file.

Upload: tranthuan

Post on 01-Feb-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

Seediscussions,stats,andauthorprofilesforthispublicationat:https://www.researchgate.net/publication/315612843

Asurveyoftheapproachesforidentifyingdifferentialmethylationusingbisulfitesequencingdata

ArticleinBriefingsinBioinformatics·March2017

DOI:10.1093/bib/bbx013

CITATIONS

0

READS

11

4authors:

Someoftheauthorsofthispublicationarealsoworkingontheserelatedprojects:

diseasesubtypediscoverybasedonintegrationofmultipletypesofomicsdataViewproject

AdibShafi

WayneStateUniversity

2PUBLICATIONS0CITATIONS

SEEPROFILE

CristinaMitrea

WayneStateUniversity

9PUBLICATIONS57CITATIONS

SEEPROFILE

TinNguyen

WayneStateUniversity

16PUBLICATIONS18CITATIONS

SEEPROFILE

SorinDraghici

WayneStateUniversity

225PUBLICATIONS7,279CITATIONS

SEEPROFILE

AllcontentfollowingthispagewasuploadedbyAdibShafion24April2017.

Theuserhasrequestedenhancementofthedownloadedfile.

Page 2: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

A survey of the approaches for identifying differential methylation

using bisulfite sequencing data

Adib Shafi, Cristina Mitrea, Tin Nguyen, and Sorin Draghici

Abstract

[PLEASE READ THE PUBLISHED VERSION FROM THIS LINK. ]DNA methylation is an important epigenetic mechanism that plays a crucial role in cellular regulatory

systems. Recent advancements in sequencing technologies now enable us to generate high throughput methy-lation data and to measure methylation up to single base resolution. This wealth of data does not comewithout challenges, and one of the key challenges in DNA methylation studies is to identify the significantdifferences in the methylation levels of the base pairs across distinct biological conditions. Several compu-tational methods have been developed to identify differential methylation (DM) using bisulfite sequencingdata; however, there is no clear consensus among existing approaches. A comprehensive survey of theseapproaches would be of great benefit to potential users and researchers to get a complete picture of theavailable resources. In this paper, we present a detailed survey of 22 such approaches focusing on theirunderlying statistical models, primary features, key advantages and major limitations. Importantly, theintrinsic drawbacks of the approaches pointed out in this survey could potentially be addressed by futureresearch.

Keywords

DNA methylation; Epigenetic modification; Differentially methylated cytosines (DMCs); Differentially methy-lated regions (DMRs); Bisulfite sequencing

Introduction

Epigenetics is the field of study that provides information on how, where and when genes are switchedon and off inside a living cell. DNA methylation is an intensively studied and well understood epigeneticmechanism that plays a vital role in many processes [1]. Due to its role in regulating gene expression,DNA methylation is an important part of cellular processes such as cell development and differentiation.Furthermore, patterns of hypermethylation have been identified in human cancers, which can provide novelinsights into the development and progression of such complex diseases [2]. Specifically, in cancer, one of thecauses of silenced tumor suppressor genes is hypermethylation.

The most studied form of DNA methylation, known as 5-methylcytosine (5-mc), involves the addition ofa methyl group to the 5-carbon of the cytosine (C) base of a DNA strand. Although approximately only 5%of the cytosine bases in the human genome are methylated, cytosine (C) followed by a guanine (G), whichis known as a CpG site, is methylated 70 − 80% of the time [3, 4]. Methylation can also occur in non-CpGcontext, such as CHG and CHH sites (where H = C, T, or A), especially in plants and stem cells [5, 6]. Recentstudies have also shown that the Ten-Eleven translocation (TET) proteins are involved in oxidizing 5-mcinto 5-hydroxymethylcytosine (5-hmC), 5-formylcytosine (5-fC) and 5-carboxylcytosine (5-caC). However,the abundance level of these methylation variants (5-hmC, 5-fC, 5-caC) is very low compared to that of 5-mc [7]. Therefore, our survey focuses on 5-mc methylation in CpG context, considering most of the methodshave been developed for analyzing this type of epigenetic modification. When a CpG site is methylated in thepromoter regions, it typically represses the transcriptional activity of that region by restricting the binding

1

Page 3: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

of specific transcription factors (TFs). Alternatively, when a CpG site is unmethylated in promoter regions,it allows for the binding of those TFs [8, 9, 10]. Given its regulatory role in cellular activities, identifyingchanges in DNA methylation across multiple biological conditions is of great interest.

The availability of the reference genome and the advanced sequencing technologies have led to methodsthat provide high-resolution methylation profiles on a genome scale. Based on the resolution at which themethylation levels are measured, current sequencing-based technologies can be divided in two categories: (i)enrichment based approaches, and (ii) bisulfite sequencing based approaches [11, 12]. The former allows usto measure the methylation levels at 100 − 200 base resolution while the latter allows us to measure themethylation levels at single base resolution. One of the challenges in measuring genome-level methylationis the amount of biological material needed, which has only recently reached levels feasible for clinicalsamples [13]. Other challenges are related to processing data from new technologies and integrating themwith different types of data in a meaningful way to provide biological insights (e.g., methylation and geneexpression). In this review, we focus on bisulfite sequencing based approaches.

Within the past few years, many tools have been developed for differential methylation analysis usingbisulfite sequencing data (Figure 1), but only a few attempts have been made to provide a review of these ap-proaches. Robinson et al. [14] provides a mini review of the approaches that identify differential methylation,briefly discussing their methodologies and current challenges. This review not only includes the approachesthat use bisulfite sequencing data but also the approaches that use DNA methylation arrays (Illumina’s 27kor 450k) and enrichment assays (MeDIP-seq). Yet the number of approaches based on bisulfite sequencingdata and the number of features considered for each approach is low. Klein et al. [15] evaluates nine ap-proaches that can possibly be used for differential methylation analysis. However, the methods are limitedto the scope of analyzing differential methylation in predefined regions using only RRBS data. Among thenine approaches, only four of them are originally designed for analyzing differential methylation. The otherfive approaches are general approaches that can be applied for RNA-Seq and gene expression data. Yu andSun. [16] evaluates only five approaches developed for the purpose of identifying DMRs. Sun et al. [17] brieflysummarizes the commonly used platforms for methylation profiling, data preprocessing techniques, and sta-tistical approaches for differential methylation analysis. This review provides a well organized conceptualoverview of approaches that identify differential methylation using bisulfite sequencing data. However, thissurvey only includes seven such approaches. In summary, all previous attempts of reviewing the approachesthat identify differential methylation using bisulfite sequencing data are limited in at least one of the fol-lowing aspects: i) the total number of approaches covered in the survey (fewer than 10 methods reviewed),ii) the applicability (e.g. only methods dealing with RRBS data), or iii) a small number of biological fea-tures considered. In order to address these issues, a comprehensive survey of the approaches that identifydifferential methylation using bisulfite sequencing data is greatly needed.

In this paper, we review 22 different approaches for differential methylation analysis, including approachesfor identifying DMCs, DMRs (both predefined and de novo regions), and methylation patterns using bisulfitesequencing data (WGBS and RRBS). We classify these approaches into seven different categories based onthe primary concepts and key techniques used to identify DM. In addition, we provide a short overviewof several general hypothesis based tests, which can also be applied for differential methylation analysis.In the following sections, first, we will provide a brief overview of bisulfite sequencing technology and theworkflow of analyzing bisulfite sequencing data. Next, we will provide a systematic review of the approacheshighlighting their pros and cons, discussing their key characteristics.

Bisulfite sequencing

The gold standard for measuring cytosine methylation is bisulfite sequencing, which has the advantage ofmeasuring methylation at a single base resolution. In this technique, DNA is treated with sodium bisulfite,which deaminates unmethylated cytosines (C) to uracils (U) leaving the methylated cytosines unchanged.Uracils are read as thymines (T) during the sequencing step. Methylation level at each CpG site is estimatedby simply counting the ratio of C/(C+T). Thus, this process allows sequence specific discrimination betweenmethylated and unmethylated CpG sites [40].

Several technologies have been developed for measuring DNA methylation based on bisulfite sequencingconversion. The most comprehensive protocol among them is whole genome bisulfite sequencing (WGBS),

2

Page 4: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

2011

QD

MR

[18]

CpG

MP

s[1

9]

BSm

oot

h[2

0],m

ethylK

it[2

1]

2012

eDM

R[2

2],C

OH

CA

P[2

3]

BiS

eq[2

4]

2013

Com

Met

[25]

DSS

[26]

,M

OA

BS

[27]

DM

AP

[28]

met

hylS

ig[2

9]

RA

DM

eth

[30]

2014

DSS-s

ingl

e[3

1],sw

DM

R[3

2]

2015

MA

CA

U[3

3]

met

ilen

e[3

4],SM

AR

T[3

5]

DSS-g

ener

al[3

6]

HM

M-F

isher

[37]

,H

MM

-DM

[38]

Get

isD

MR

[39]

2016

Figure 1: Timeline of the approaches that identify differential methylation using bisulfite sequencing data.

which provides genome-wide DNA profiling. However, the application of this protocol on the whole genomeis very expensive when it comes to studying organisms with large genomes. More cost-effective protocols,such as reduced representation bisulfite sequencing (RRBS) and enhanced RRBS (ERRBS), have allowed formethylation analysis with reduced sequencing requirements through a more targeted approach for CpG richgenomic regions that meet specific length requirements [41]. These techniques therefore are more affordablefor studies with multiple replicates.

The overall workflow for bisulfite sequencing data analysis is displayed in Figure 2. The overall pipelineconsists of 6 major elements: i) the input including methylation data (in FASTA/FASTQ format) andthe reference genome, ii) data processing and quality control, iii) alignment of short reads to the referencegenome, iv) post-alignment analysis, v) differential methylation analysis, and vi) the output including DMCs,DMRs, and methylation patterns. The details of each element will be described in the following sections.

Pre-Analysis

Data pre-processing

Bisulfite sequencing data consist of short read sequences in the FASTA/FASTQ file format. Data processingstarts with performing quality control operations on the raw sequencing reads, including quality trimming,and adapter trimming. Quality trimming reduces methylation call errors by trimming the bases that havepoor quality scores, whereas adapter trimming removes the known adapters from short reads in order toincrease mapping efficiency. Existing tools for quality control include FASTX-Toolkit [42], PRINSEQ [43],SolexaQA [44], Cutadapt [45], Trimmomatic [46], and Trim Galore! [47]. Both the input and output of thesetools are files in the FASTA/FASTQ format.

Read mapping

After quality control, bisulfite sequencing reads can be aligned to the reference genome in order to estimatethe methylation levels. Simply aligning these reads by using standard aligners results in poor mappingefficiency since the bisulfite treatment introduces additional discrepancies between the sequencing reads andthe reference genome by converting the unmethylated cytosines to thymines. Therefore, new strategieswere proposed for bisulfite sequencing read alignment. Existing bisulfite sequencing alignment approachescan be divided in two categories: three-letter aligners and wildcard aligners. Three-letter aligners, such asBismark [48], BS Seeker [49], MethylCoder [50], BRAT [51], and GNUMAP-bs [52], convert all Cs into Ts inthe forward strand and all Gs into As in the reverse strand of the reference genome. Equivalently convertedreads are then aligned to these pre-converted forms of the reference genomes using standard genome alignerssuch as Bowtie [53], and Bowtie2 [54]. In contrast, wildcard aligners, such as BSMAP [55], RRBSMAP [56],GSNAP [57], and RMAP [58], replace the Cs of the reference genome with the wildcard letter Y that matchesboth Cs and Ts in the sequencing reads. The alignment results are usually stored in SAM/BAM file format.

3

Page 5: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

Figure 2: The workflow of analyzing DNA methylation using bisulfite sequencing data.

Post-Alignment analysis

After mapping the reads, an optional post-alignment step can be performed to extract meaningful biologicalinformation from the alignment results prior to differential methylation analysis. Several post-alignmentanalysis tools have been developed, including BiQ Analyzer [59], QUMA [60], BRAT [51], MethyQA [61],BSPAT [62], and MethGo [63]. Most of these tools provide summary statistics, quality assessment, andvisualization of the methylation data. Some of these tools include extra features such as read mapping (e.g.BSPAT and BRAT), identifying DNA methylation co-occurrence pattern (e.g. BSPAT), single nucleotidepolymorphisms and copy number variation calling (e.g. MethGo), and detecting allele specific methylationpatterns (e.g. BSPAT).

Differential methylation analysis

After obtaining the methylation information of the CpG sites, typically the next downstream analysis isto perform differential methylation analysis, which is usually done in the form of identifying differentiallymethylated cytosines (DMCs) or differentially methylated regions (DMRs). Identification of DMCs involves

4

Page 6: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

comparing the methylation level at each CpG site across the phenotypes (two or more) and applying statisticaltests for hypothesis testing. Identification of DMRs is usually a two-step process: i) the identification ofDMCs, and ii) grouping the neighboring DMCs as contiguous DMRs by certain distance criteria. However,some approaches can directly identify DMRs. DMCs/DMRs occasionally can be linked to transcriptionalrepression of the associated genes; therefore, they provide crucial biological insights that may lead to thedevelopment of potential drug candidates [1].

In order to identify putative potential DMCs/DMRs from bisulfite sequencing data, some characteristicsneed to be considered. One such characteristic is the spatial correlation between the methylation levels of theneighboring CpG sites, which plays an important role in getting an accurate estimation of the methylationlevels [3, 64]. Incorporating spatial correlation in DM analysis can reduce the required sequencing depth andcan estimate the methylation status of the missing CpG sites [20]. Sequencing depth is another importantcharacteristic that is directly related to the certainty of the methylation scores of CpG sites. Consideringsequencing depth while identifying DMRs is crucial since it can take into account the sampling variabilitythat occurs during sequencing. Another such characteristic is biological variation among replicates, whichis crucial in identifying the regions that consistently differ between groups of samples [65, 66]. Ignoringbiological variation while detecting DMRs might lead to a high number of false positives in the results [14,20, 26]. This is due to the fact that the methylation levels of the CpG sites are heterogeneous not only whenthe cell types are different but also when the cells are of the same type [67, 68, 69, 70].

Classical hypothesis testing methods, such as Fisher’s exact test (FET), Chi-square (χ2) test, regressionapproaches, t-test, moderated t-test, Goeman’s global test, ANOVA, etc. can be used to identify differentialmethylation using bisulfite sequencing data [3, 19, 23, 71, 72]. These approaches can be divided into twocategories based on the data type they use: count based hypothesis tests and ratio based hypothesis tests.

Count based hypothesis tests: Input of these hypothesis testing methods are count values, whichcan be either the number of reads or the number of CpG sites in a predefined genomic region. Fisher’s exacttest (FET) is a classical statistical test used to determine if there are nonrandom associations between twocategorical variables. In the context of methylation analysis, we can use the data to build a contingencytable, where the two rows represent the two methylation states, and the two columns represent a pair ofsamples. When applying FET for two groups of samples, the counts for a methylation status within eachgroup are aggregated into a single number [21]. Chi-square test (χ2) is another classical method to testthe relationship between two categorical variables (methylated vs. unmethylated). In contrast with FET,it allows for testing across multiple samples. As pointed out by Sun et al. [17] and Hurlbert et al. [73],there are several issues related to the aggregation of read counts into a single number while applying tests ofindependence (FET and Chi-square test). First, the read counts are not independent; they represent differentsets of inter-dependent or correlated observations. Thus aggregating the counts violates the fundamentalassumption underlying the test for independence. Second, due to uneven coverage of each individual site, theresults are biased towards the samples with higher coverage. Third, by aggregating (summing) the counts,some of the biological variation (e.g., sample size, intra-group variance) is not taken into account by thehypothesis testing. Therefore, using FET and Chi-square test to compare two groups of samples could leadto a high number of false positives [14, 20, 26].

Regression approaches (e.g. Poission, quasi-Poisson, negative binomial regression) are primarily used fordetecting differentially expressed genes using RNA-Seq data, but they can also be applied in the contextof differential methylation analysis [15]. For example, the read counts can be modeled using a Poissondistribution and a modified Wald test can be used to detect differential methylation as the difference betweentwo Poisson means [74, 75].

Ratio based hypothesis tests: These hypothesis tests use methylation percentage (methylation ratio)instead of count values. For a particular CpG site, methylation percentage is calculated by taking the ratiobetween the methylated read counts and the total read counts of that site. To compare the methylationdifference level between two groups (phenotypes) of samples, classical tests such as t-test [76, 77], moderatedt-test (limma) [78] or Goeman’s global test [79] can be used. While t-test is a classical approach to comparethe means, limma and Goeman’s test are empirical Bayesian approaches that were primarily designed todetect differentially expressed genes using microarray data. When analyzing methylation levels across mul-tiple groups of samples, ANOVA [80] can be used instead of multiple pair-wise comparisons. Compared tocount based hypothesis tests, the ratio based tests take into account the biological variation across multiplereplicates. However, since they only take into account the ratio of the reads (methylated reads vs. all reads),

5

Page 7: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

they ignore the sequencing depth within the CpG sites.Although classical hypothesis testing methods are somewhat useful, straightforward and easy to use, they

are not efficient in more sophisticated methylation analysis, such as identifying de novo regions, consideringspatial correlation among the methylation levels of the CpG sites, estimating methylation levels of missingCpG sites, etc. Over the past few years, several approaches have been developed to address these challenges,which are discussed and summarized in the following subsections:

Logistic regression based approaches

Approaches in this category model the read counts of the CpG sites by using logistic regression in orderto identify differential methylation. One of the popular approaches in this category is methylKit [21],which uses logistic regression to model the methylation proportion at a given base or region when biologicalreplicates are available. In the absence of biological replicates, methylKit uses Fisher‘s exact test to identifydifferential methylation. P-values are corrected using the false discovery rate (FDR) approach or the slidinglinear model approach [81]. MethylKit is commonly used to identify DMCs from predefined regions (RRBSdata). However, it can also be used to identify DMRs from WGBS data based on user defined tilingwindows. Major contribution of methylKit is that it can take into account the sequencing coverage. It canincorporate additional covariates into the model and work with CHG or CHH methylation. It also providesfunctionalities such as sample-wise methylation summary, sample clustering, annotation and visualization ofdifferential methylation, etc.

Another method named eDMR [22] was proposed as an extension of methylKit. eDMR models thedistances between the neighboring CpG sites using a bimodal normal distribution and estimates DMRboundaries using a weighted cost function. After estimating the regional boundaries, DMRs are filteredbased on the mean methylation difference, the number of DMCs, and the number of CpG sites. Significanceof the DMRs are calculated by combining the p-values of the DMCs using Stouffer-Liptak method [82]. Thep-values for DMRs are then corrected for multiple comparisons using the FDR method. eDMR provides listof DMRs and their annotation as output.

Approaches in this category take sequencing coverage into account. They can incorporate additionalcovariates into the model as well. However, they do not consider the biological variation among the replicates.Although eDMR estimates the significance of the identified regions based on spatial auto correlation, it doesnot consider the spatial correlation among the CpG sites when estimating the methylation levels.

Smoothing based approaches

Approaches in this category assume that methylation levels of the CpG sites vary smoothly across the genome.They perform “smoothing” across the samples or predefined regions, which is a technique to estimate themethylation levels of the CpG sites by borrowing information from their neighbors. Group differences acrossdifferent conditions are computed based on the estimated methylation values of the CpG sites. Finally,different statistical tests are used to identify the differentially methylated sites or regions.

One of the most commonly used smoothing based approaches is BSmooth [20], which relies on smoothingacross the genome within each sample. It looks for group differences via CpG-wise t-tests in order to identifyDMRs between two groups. The BSmooth algorithm begins with aligning the sequencing reads to thereference genome. Two alternative pipelines are available for the users to align the reads. The first pipeline,which supports gaped alignment and the alignment of the paired-end bisulfite-treated reads, is based onin silico bisulfite conversion that uses the Bowtie-2 aligner to align the reads [54]. The second pipeline isbased on a newly developed aligner named Merman, which supports the alignment of the colorspace bisulfitereads. After aligning the reads, sample specific quality assessment metrics are compiled. Local likelihoodsmoothing is applied within a smoothing window across the samples to estimate the methylation levels ofthe CpG sites. A signal-to-noise statistic similar to t-test is employed to identify the DMCs. Finally, DMRsare defined by merging the consecutive DMCs based on some defined criteria, such as a cutoff value of thet-statistic, maximum distance between the CpG sites, minimum number of CpG sites, etc.

BSmooth was the first approach primarily developed for DMR identification that takes into account thebiological variation among replicates. BSmooth reduces the required sequencing coverage by applying thelocal likelihood smoothing approach across the samples. It can identify de novo regions from WGBS datasets.

6

Page 8: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

On the other hand, BSmooth lacks suitable error measurement criteria within the identified DMRs. As aresult, there is no way to check whether the identified CpG sites inside the predicted DMRs are true DMCsor selected erroneously. BSmooth predicts methylation values of the CpG sites based on the last observedslope. Hence, for the genomic regions that are not covered by the reads, previously observed methylationlevel will continue, resulting in a biased estimation of the methylation level (i.e., extrapolated methylationvalues of 0 and 1) [24]. BSmooth is not applicable to those datasets that do not have biological replicates.In addition, BSmooth is limited to comparisons between two groups of conditions.

Another approach in this category, BiSeq, performs the smoothing of methylation data across definedcandidate regions instead of across the samples (like BSmooth) [24]. The pipeline begins with definingCpG clusters within the genome based on a minimum number of frequently covered CpG sites (CpG sitesthat are covered by the majority of samples) and a proximity distance threshold defined by the user. Asmoothing function is modeled for each defined cluster. While modeling the smoothing function, the coverageinformation for each CpG site is taken into account in order to make sure that the CpG site with high coveragehas a greater impact on the estimated methylation level than the CpG site with low coverage. Group effectsof the CpG sites are modeled using beta regression with probit link function. DMCs are identified using Waldtest procedure. Next, a hierarchical testing procedure is applied to identify significant clusters containingat least one DMC. While testing the target regions, weighted FDR (WFDR) is applied to take into accountthe size of individual clusters [83]. A location-wise FDR approach is applied to trim the CpG sites that arenot differentially methylated within the selected significant clusters.

One of the major contribution of BiSeq approach is that it provides region-wise error control measurementto test the target regions. This approach is also capable of adding additional covariates to the regressionmodel. In contrast, one of the limitations of the BiSeq approach is that it is only suitable for analyzingexperiments that have predefined regions such as RRBS datasets.

In general, smoothing based approaches have the advantage of considering the spatial correlation betweenthe methylation levels of the CpG sites. By performing smoothing, the required sequencing coverage and thevariance of the methylation levels can be reduced [20]. Furthermore, they can estimate the methylation levelsof missing CpG sites. On the other hand, smoothing based approaches can not detect the low CpG densityregions where methylation has sharp changes such as transcription factor binding sites (TFBs). TFBs areusually very small (i.e., < 50 bp) which might consist of a single CpG that is differentially methylated [84].Thus, biological events involving a single CpG site might not be detected by the smoothing approaches. Inaddition, these approaches are not appropriate for biological systems whose true methylation levels of theCpG sites are not spatially correlated.

Beta-binomial based approaches

Approaches in this category characterize the methylation read counts as a beta-binomial distribution. Inabsence of any biological or technical variation, methylation proportion of a particular CpG site followsa binomial distribution since sequencing reads over a CpG site can be either methylated or unmethylated.Whenever biological and technical variation are present in the data, methylation proportions of the CpG sitesare assumed to follow a beta distribution. Therefore, in the presence of biological replicates, an appropriatestatistical model for methylation analysis is the beta-binomial model as it can take into account both samplingand biological variability.

Over the past few years, several beta-binomial based approaches have been developed in order to identifydifferential methylation, such as DSS [26], MOABS [27], RADMeth [30], methylSig [29], DSS-single [31],MACAU [33], DSS-general [36] and GetisDMR [39]. These approaches differ from each other in the waythey estimate regression parameters, calculate p-values, estimate DMR boundaries, etc.

DSS is one of the approaches in this category that relies on a beta-binomial hierarchical model to identifydifferential methylation using bisulfite sequencing data. In this model, the prior distribution is constructedfrom the whole genome, which is either methylated or unmethylated. True methylation proportions of theCpG sites among the replicates are then modeled using the beta distribution parameterized by group meanand a dispersion parameter. The biological variability is captured by the beta distribution, whereas thesampling variability is captured by the binomial distribution. Variation across the methylation proportion ofthe CpG sites relative to the group mean is captured by the dispersion parameter, which is estimated by anempirical Bayes approach. When the sample size is small, a shrinkage approach is employed to estimate the

7

Page 9: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

dispersion parameter in order to improve the overall performance. Differentially methylated CpG sites aredetermined by using p-values from the Wald test, which is performed by comparing the mean methylationlevels between two groups. Lastly, candidate DMRs are defined by applying user specified thresholds onDMR characteristics among which are: p-value, minimum length, minimum number of CpG sites, etc.

The key contribution of the DSS approach is the shrinkage procedure that improves the dispersionparameter estimation. For this reason, this approach is particularly useful when the sample size is small.By applying the Wald test procedure, this approach takes into consideration the biological variation andsequencing coverage.

A more recent method, named DSS-single, is an improved version of the DSS approach which cantake into account the spatial correlation among the CpG sites across the genome. In addition, DSS-singleconsiders the within-group variation without biological replicates by using the neighboring CpG sites as‘pseudo-replicates’. Similar to DSS, DSS-single captures the technical variability using binomial distributionand the biological variability using beta distribution. The beta distribution is parameterized with the groupmean and dispersion parameter. DSS-single estimates the group mean using a smoothing function and thedispersion parameter using an empirical Bayes procedure. Hypothesis testing is performed using the Waldtest to identify the DMCs. Later, user defined thresholds are applied in order to define the DMR boundariesand select candidate DMRs.

An even more recent variation of DSS approach, named DSS-general, identifies differentially methylatedloci from bisulfite sequencing data under general experiment design. DSS-general identifies differentiallymethylated loci by modeling the methylation count data for each locus using the beta-binomial regressionwith the “arcsine” link function. The “arcsine” link function is applied to perform a data transformationthat decreases the dependency of the data variance on the mean and prepares it for the next step. Due tothis data transformation, the regression coefficient and the variance matrix can be estimated by applying thegeneralized least square (GLS) method, as opposed to the beta-binomial generalized linear model or logisticregression which are limited when values are separable (e.g. values for unmethylated sites are close to 0,values for methylated sites are close to 1). Finally, Wald test is used to perform hypothesis testing.

The key advantage of DSS-general approach is that it is applicable to bisulfite sequencing data withmultiple groups or covariates. In addition, it uses “arcsine” link function, which is more efficient than otherwidely used “logit” and “probit” functions since it estimates the regression parameters in one iteration.

MOABS is another approach that relies on beta-binomial assumption to identify differential methyla-tion. Similar to DSS, the prior distribution is constructed from the whole genome, resulting in a bimodaldistribution. The posterior distribution follows a Beta distribution, which is estimated using an empiri-cal Bayes approach. When biological replicates are available, the posterior distribution is generated usingthe maximum likelihood approach. The significance of the differential methylation between two samples isrepresented by a single metric named credible methylation difference (CDIF), which incorporates both thebiological and statistical significance of the differential methylation. MOABS can also work with CHG orCHH methylation.

RADMeth is another analysis pipeline that relies on the beta-binomial assumption. RADMeth em-ploys a beta-binomial regression approach using “logit” link function to model the methylation levels of theCpG sites across the samples. Regression parameters are estimated using a standard maximum likelihoodapproach. In the beta-binomial regression model, RADMeth incorporates the experimental factors usinga model matrix. The differential methylation of a particular site is determined by comparing two fittedregression models (i.e. reduced model without factors and full model with factors) using the log-likelihoodratio. Subsequently, p-values of the neighboring CpG sites are combined using the weighted Z-test (i.e.Stouffer-Liptak test [85]) in order to obtain the differentially methylated regions. The key contribution ofthis approach is the ability to analyze WGBS data in multiple factor experiments.

MethylSig is another analysis pipeline that utilizes beta-binomial model across the samples to identifyeither DMCs or DMRs. The pipeline begins with taking the number of Cs and Ts as input. The approachuses the beta-binomial model to estimate the methylation levels at each CpG site or region, which involvesthe two following steps: i) estimate the dispersion parameter for each CpG site or region, which accountsfor biological variation among the samples within a group; and ii) calculate the group methylation level ateach CpG site or region using the estimated dispersion parameters. In each step, local information can beincorporated from nearby CpG sites or regions to increase statistical power. The significance level of themethylation difference is calculated using the likelihood ratio test. Similar to DSS, MethylSig is useful when

8

Page 10: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

the sample size is small. MethylSig uses local information and employs a maximum likelihood estimator tocompute both the methylation level and the variance.

MACAU is based on binomial mixed model (BMM) that takes into account the population structuresfrom a dataset. This model is a generalized beta-binomial model consisting of an extra term to model thepopulation structure. In absence of that extra term, this model can be reduced to a beta-binomial model.In this approach, the prior distribution is constructed from a binomial mixed model whereas the posteriordistribution is constructed from a log-normal distribution. Model parameters are estimated by using aMarkov chain Monte Carlo (MCMC) algorithm based approach. Hypothesis testing is performed by usingWald test. Finally, DMRs are constructed by merging the DMCs using empirical thresholds.

One advantage of this approach is that it can add a predictor variable of interest in the model to checkthe association with any genetic background. In addition to considering biological variability among thereplicates and the sampling variability among the sequencing reads, this method also takes into considerationthe population variability. Furthermore, it can be applied to both WGBS and RRBS dataset.

GetisDMR, a recent beta-binomial based approach, identifies variable size DMRs directly from WGBSdata by using a local Getis-Ord statistic, which is commonly used to identify statistically significant spatialclusters (hotspots). By incorporating this statistic into differential methylation analysis, GetisDMR accountsfor spatial correlation among the methylation levels of the CpG sites, along with the biological and samplingvariability. When biological replicates are available, beta-binomial regression with logistic link function isused to model the methylation level of each CpG site. Model parameters are estimated by using the maximumlikelihood function. Hypothesis testing is performed by using the likelihood ratio test. In the absence ofbiological replicates, methylation levels are modeled by using binomial distribution and hypothesis testingis performed by using Fisher‘s exact test. P-values from the hypothesis testing are further used to calculatez-scores. Finally, a local Getis-Ord statistic is used based on the z-scores to identify DMRs using theinformation from the neighboring CpG sites. The Getis-Ord statistic uses the distribution of the data (i.e.z-scores) to compute a score of the non-random association between a data point and it’s neighbors, where apositive score shows a positive association and a negative score shows a negative association. This statisticis then used to identify data regions with points that exhibit non random associations (i.e., DMRs).

One of the primary strengths of GetisDMR is that it can detect DMRs with variable length, instead ofdepending on user specified threshold parameters. It can take into account the spatial correlation betweenthe neighboring CpG sites. Additionally, it can incorporate additional confounding factors into the model.Furthermore, it can work with multiple groups, with or without biological replicates. One drawback of thisapproach is that it can not work with enriched regions, such as RRBS data.

Beta-binomial based approaches are useful since they take into account both sampling variability amongthe read counts and biological variability among the replicates. Furthermore, these approaches are able toidentify differential methylation at single base resolution from low CpG-density regions (e.g., TFBs). Onthe other hand, most of the beta-binomial based approaches (except DSS-single, MACAU and GetisDMR)do not take into account the spatial correlation between the methylation levels of the CpG sites.

Hidden Markov Model (HMM) based approaches

Approaches in this category employ Hidden Markov Model (HMM) to identify differentially methylated pat-terns from bisulfite sequencing data. These approaches model the methylation levels of the CpG sites asmethylation states (i.e. hypermethylation, hypomethylation and no change) instead of continuous methyla-tion values. Transition probabilities among the methylation states represent the distance distribution amongthe DMCs, whereas emission probabilities represent the likelihood of differential methylation for the CpGsites. High transition probabilities and low transition probabilities are used to model the neighboring CpGsites that have high similarities and low similarities within their methylation levels, respectively. Parametersare estimated usually by using established learning algorithms whereas potential DMRs are identified usingdifferent statistical approaches.

One of the approaches in this category named ComMet [22], included in the Bisulfighter methylationanalysis suite [25, 86], combines all the samples within a group into one sample and identifies the DMRs bycomparing a pair of two samples. This method captures the probability distribution of distances betweenthe neighboring DMCs and adjusts the DMC chaining criteria automatically for each dataset. Transitionprobabilities are estimated using an expectation maximization algorithm whereas emission probabilities are

9

Page 11: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

estimated from a beta-binomial mixture model. Parameters of the beta-binomial model are estimated byincorporating an unsupervised learning algorithm. DMRs are identified by using a dynamic programmingalgorithm

One of the advantages of ComMet is that it does not require biological replicates to identify DMRs. Ittakes into account the sequencing coverage and the spatial distribution of the neighboring CpG sites. Onthe other hand, one of the limitations of this approach is that it does not take into account the biologicalvariation across replicates, which might lead to higher number of false positives in the results [14, 20, 26].

Another approach in this category is HMM-Fisher [37], which estimates the methylation status ofthe CpG sites for each sample instead of combining all the samples. Similar to ComMet, HMM-Fishermodels both the similarity and dissimilarity of the methylation levels of the neighboring CpG sites usingtransition probability. HMM-Fisher estimates the transition probabilities using a Dirichlet distributionwhereas emission probabilities are computed using a truncated normal distribution. After estimating themethylation levels of all the CpG sites for each sample, differentially methylated CpG sites are identifiedusing Fisher’s exact test. Identified DMCs are further grouped into DMRs if the distance between the CpGsites is less than 100 bases. Non-consecutive CpG sites are reported as DMCs in the output.

One of the major contribution of HMM-Fisher is that it can identify DMRs of variable size, instead ofdepending on user defined boundary thresholds. It takes the biological variation among the replicates intoaccount and can provide both DMCs and DMRs as output. It can also be used to identify sample-wisemethylation patterns.

HMM-DM [38] is another approach that employs HMM to identify differential methylation. HMM-DMdirectly estimates the differential methylation states of the CpG sites for each sample across the groups.In this approach, the transition probability of each CpG site only depends on the methylation state of theimmediate previous CpG site. Like HMM-Fisher and ComMet, the transition probabilities are estimatedfrom a Dirichlet distribution. In contrast, emission probabilities are estimated from a beta distribution.Differential methylation states for the CpG sites are estimated using the Markov Chain Monte Carlo (MCMC)method. Finally, consecutive CpG sites with same methylation status are grouped together based on userdefined thresholds to form DMRs. Similar to HMM-Fisher, HMM-DM can identify variable size DMRs fromWGBS and RRBS data. It also takes into account the biological variation among the replicates.

In general, one of the key advantages of HMM based approaches is that they can identify DMRs withvariable size in contrast to the approaches that use a fixed window size. They consider the spatial correlationof the CpG sites by borrowing methylation information from their neighbor sites. These approaches can alsoidentify independent DMCs or short DMRs; therefore, they can identify sharp methylation changes amongthe CpG sites. In addition, all the three approaches discussed above are applicable to both WGBS andRRBS datasets.

Entropy based approaches

Entropy based approaches identify the methylation difference across multiple samples using Shannon en-tropy [87], which is a quantitative measure of the variation or change in a series of events. Approaches inthis category are capable of providing sample specific methylation information.

QDMR [18] was the first approach that utilizes Shannon entropy [87] for the purpose of identifyingDMRs from bisulfite sequencing data. It quantitatively identifies DMRs from predefined regions based onthe average methylation levels of the CpG sites of the regions. The probability that a sample is methylatedat a specific location is calculated by taking the ratio of the methylation level of that sample and the totalmethylation level across all samples. The original entropy formula can be used to measure the methylationdifference across samples, where lower entropy represents higher methylation difference. However, this wayof calculating entropy is biased towards hypermethylation in minor samples. Therefore, QDMR introducesa one-step Tukey biweight weighted mean to make their approach less sensitive to such outliers. Finally,a region is differentially methylated if the weighted entropy for that region is smaller than a certain cutoffwhich is determined by using a probability model. QDMR takes into account the biological variability acrossthe samples. In addition to the list of DMRs, QDMR provides quantification, visualization and annotationof the DMRs for each sample. One of the limitations of this approach is that it can identify DMRs onlyfrom predefined regions (RRBS); therefore, it is unable to identify de novo regions.

An improved approach in this category, CpG MPs [19], have been proposed from the same research

10

Page 12: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

group which can identify methylation patterns across paired or multiple samples using WGBS data. Thisapproach identifies de novo methylated and unmethylated regions using hotspot extension algorithm basedon the methylation status of the neighboring CpG sites. It combines a combinatorial algorithm with Shannonentropy to identify the regions that are differentially methylated (i.e. DMRs) or conservatively unmethylated(i.e. CUMRs).

The overall workflow of CpG MPs is divided into four modules. The first module normalizes the sequenc-ing reads of the CpG sites into methylation levels. The second module categorizes the methylation statesof the CpG sites based on their normalized methylation levels into four categories such as unmethylatedCpGs, partially unmethylated CpGs, methylated CpGs and partially methylated CpGs. CpGs are thenscanned from 5′ to 3′ end in order to extract a certain number of methylated (unmethylated) CpGs to createmethylated (unmethylated) hotspots. Next, the hotspots are extended both upstream and downstream toincorporate partially methylated or partially unmethylated CpGs into their corresponding hotspots. Neigh-boring regions with the same patterns are then combined based on a given threshold. Also, the mean valueand the standard deviation of the methylation levels of the CpG sites within each region are computed. Thethird module identifies conservatively unmethylated regions (CUMRs), conservatively methylated regions(CMRs) and differentially methylated regions (DMRs) by using a combinatorial algorithm with Shannonentropy. At first, the identified methylated and unmethylated regions are mapped to the reference genomeand then overlapping regions (ORs) are recorded in the reference genome. Next, the hotspot extension tech-nique is used to merge the neighboring ORs with the same methylation patterns across multiple samples. Amodified Shannon entropy based method is used to identify the regions that are significant across multiplesamples. The fourth module analyzes sequencing features and visualizes the identified regions.

One key advantage of CpG MPs is that it determines the DMR boundaries by applying combinatorialalgorithm instead of depending on empirical thresholds to identify DMRs; hence, it can detect variable lengthboundaries. It can also be used to identify methylation patterns for each sample. In addition, CpG MPsconsiders biological variation among the replicates. However, CpG MPs does not include any error controlmeasurement among the identified regions.

A more recent approach, SMART [35], extends the weighted entropy concept introduced by QDMR todetermine cell type-specific methylation patterns from a large number of DNA methylomes. The input ofSMART is the sample-wise methylation status of the CpG sites. SMART first quantifies the methylationspecificity across the samples using Shannon entropy with a one-step Tukey biweight weighted mean. Next,it incorporates methylation similarities between neighboring CpG sites by estimating the methylation levelof the sites based on Euclidean distance. These similarity metrics and methylation specificity states are thenused to segment the genome into groups of CpG sites. Finally, a group of CpG sites is called hypermethylated(hypomethylated) if the methylation levels of that group is significantly higher (lower) than the averagemethylation levels of all samples determined by one sample t-test.

Major contribution of SMART is that it can identify cell type-specific methylation marks (i.e. HyperMarkand HypoMark) from a large sample cohort. Instead of depending on user defined thresholds, it determinesDMR boundaries of variable sizes by quantifying the methylation levels of the CpG sites. It also providesfunctional annotation of the identified methylation marks. It considers the biological variation among thereplicates and spatial correlation among the methylation levels of the CpG sites across the genome. Inaddition, it can be applied to both WGBS and RRBS data.

One of the key benefits of the entropy based approaches is that they can directly identify DMRs withoutidentifying DMCs. As a result, entropy based approaches that can detect de novo regions (i.e., CpG MPsand SMART) do not depend on empirical boundary estimations. Furthermore, these approaches take intoaccount the biological variation within replicates.

Mixed statistical tests based appraoches

Approaches in this category rely on established statistical tests, such as Fisher’s exact test, t-test, ANOVA,etc. to identify DMCs/DMRs. These statistical tests are applied to CpG sites across the samples or withinpredefined genomic regions (i.e., fixed/variable size windows).

One of the approaches in this category, COHCAP [23], identifies differentially methylated CpG islandsfrom two or more groups using predefined regions. It also provides integration with gene expression dataand visualization of the results. The pipeline starts with taking aligned read counts (e.g., output of Bismark

11

Page 13: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

aligner [48]) as input. CpG sites are marked as methylated or unmethylated based on a user defined threshold.P-values of the CpG sites are first calculated by using different statistical approaches (i.e., Fisher’s exact test,ANOVA and t-test) based on the chosen experimental design. Later the p-values are corrected using the FDRapproach. CpG sites are filtered based on p-value of the CpG site, average methylation proportion across allthe samples, and FDR value. CpG islands with a minimum number of filtered CpG sites are considered ascandidate DMRs. In the “average by CpG site” pipeline, p-values of the CpG sites within candidate DMRsare calculated by the previously selected statistical method. In the “average by CpG island” pipeline, betavalues of the filtered CpG sites within each candidate DMR are averaged, and then a p-value is calculatedbased on the averaged beta value. The major contribution of COHCAP is that it provides integration ofgene expression data with differential methylation analysis. In addition, it takes into account the biologicalvariation among the replicates.

DMAP [28], another approach in this category, is a fragment-based approach primarily designed forthe RRBS protocol to identify differentially methylated fragments (DMFs). Nonetheless, this approach canalso detect DMRs from WGBS data. In addition to the identification of DMRs/DMFs, DMAP providesinformation about nearby genes and CpG sites.

The input of DMAP is methylated read counts in Bismark aligner [48] format. In order to identifycandidate genomic regions from WGBS data, DMAP defines fixed-size windows (i.e. default 1000bp). ForRRBS data, it defines fragments of variable sizes (40-220bp). Next, a p-value is calculated for each region orfragment based on the methylated CpG counts using a chosen statistical test (Chi-square test, Fisher’s exacttest and ANOVA). Fisher’s exact test is recommended for pairwise comparison, Chi-Squared test is recom-mended for testing variability across multiple samples, and ANOVA is recommended for comparing groupsof samples. Candidate regions are selected as DMRs (for WGBS data) and DMFs (for RRBS data) based ona user defined p-value threshold. Options to correct for multiple comparisons are also provided. The outputis a list of candidate regions/fragments with their p-values and information regarding the statistical test thatwas applied. Furthermore, DMAP provides gene annotation features of the identified regions/fragments.Major contribution of this approach is that it can detect variable-size fragments (DMFs) from predefinedregions.

swDMR [32], another approach in this category, integrates multiple commonly used statistical ap-proaches in order to identify DMRs from WGBS data. The pipeline begins with taking the methylated readcounts of each CpG site (preferably from the Bismark aligner [48]) as input which are later converted tomethylation ratios. Next, it divides the genome into multiple overlapping fragments or windows of equallength based on user defined thresholds. A statistical approach is chosen from a list of commonly used ap-proaches (i.e., Fisher’s exact test, t-test, Chi-square, Wilcoxon, ANOVA and Kruskal Wallis test) to performhypothesis testing within each window across two or more samples. For two samples, methylation levelsof the CpG sites are compared using t-test, Wilcoxon, Chi-square or Fisher’s exact test. For more thantwo samples, methylation levels are compared using either ANOVA or Kruskal Wallis test. Therefore, foreach window swDMR provides a p-value generated using the selected statistical test. The resulting p-valuesare corrected for multiple comparisons using the FDR approach. The regions with corrected p-values lowerthan a predefined threshold are selected as potential DMRs. Using an extension function, two potentialDMRs are merged if the distance between them is less than a predefined threshold. The merged DMRs aretested with the previously selected statistical test, and p-values are corrected with respect to the new DMRboundaries. Finally, the merged DMRs with the corrected p-values less than the user defined threshold areselected as candidate DMRs. swDMR approach can be used without biological replicates and can work withCHG or CHH methylation. It also provides functionalities such as DMR cluster analysis, visualization andannotation of DMRs, etc.

The key advantage of the approaches in this category is that they provide flexibility in selecting differentstatistical tests, and methods for multiple test correction. In contrast, these approaches do not take intoaccount the spatial correlation between the methylation levels of the neighboring CpG sites. In addition,these approaches either work on predefined regions or divide the genome into windows of fixed/variable size.Hence, they miss the low CpG density regions where methylation has sharp changes such as transcriptionfactor binding sites (TFBs) that can contain a single differentially methylated CpG site [84]. Importantly,they depend on user defined thresholds to estimate the DMR boundaries.

12

Page 14: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

Binary segmentation based approaches

Approaches in this category use binary segmentation algorithm to recursively divide the genome in order toidentify candidate regions from bisulfite sequencing data. The only approach in this category, metilene [34],utilizes a circular binary segmentation algorithm to identify DMRs. It can be used to analyze both WGBSand RRBS experiments across multiple samples with or without replicates.

The pipeline starts with a pre-segmentation step that divides the genome into primary regions based onthe available methylation information. The pre-segmented regions are then iteratively segmented using acircular binary segmentation algorithm in order to identify a window with the maximum mean differencesignal (MDS). The segmentation is terminated when a segment has less number of CpGs than a predefinedthreshold, or it does not show any improvement in the two-dimensional Kolmogorov-Smirnov test results.The identified window is marked as a potential DMR. The output of metilene is a list of DMRs with theirp-values, adjusted p-values and the p-value from a Mann-Whitney U test.

Metilene can detect de novo regions of various lengths without relying on user defined boundary thresh-olds. It takes into account the variation among biological replicates. In addition, it can predict methylationlevels of the missing CpG sites using beta distribution. One of the limitations of metilene is that the resultgreatly depends on the minimum segment size parameter, which can lead to false negatives (if it is too high)or false positives (if it is too low). In addition, it does not consider the spatial correlation of the methylationlevels of the CpG sites across biological replicates.

Figure 3: Pros and cons of the seven categories used to identify differential methylation discussed in thissurvey.

13

Page 15: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

Figure 4: The overall workflow of 22 approaches developed for differential methylation analysis. t-test*denotes a signal-to-noise statistic similar to the t-test. Predefined criteria represents user defined thresholdssuch as p-value cutoff of the DMCs, length of the DMRs, distance between neighbor DMRs, minimumnumber of DMCs per DMR, cutoff value of CDIF (only for MOABS), etc. HMM denotes Hidden MarkovModel, MCMC denotes Markov Chain Monte Carlo, and CDIF denotes credible methylation difference.

Discussion

In this survey, we briefly summarize 22 approaches that identify differential methylation (DM) using bisul-fite sequencing data focusing on their important features, such as concept used, protocol used, biologicalvariability, spatial distribution, additional covariates, error correction, sequencing coverage, identifying denovo regions, etc. The approaches are categorized into seven different categories based on their primaryconcepts or techniques used to identify DM. Some of the approaches involve multiple concepts to identifyDM; hence, they could be assigned to multiple categories. On such cases, we categorize the approach basedon the concept that the authors highlighted. Pros and cons of these categories are summarized in Figure 3.The important features of the approaches covered in this survey are summarized in Table 1. Moreover, theworkflow of the approaches, including the information about genome segmentation, difference quantificationand DMR calling, are described in Figure 4.

Note that there are other possible ways to categorize these approaches. For instance, this can be donebased on the data type used to estimate the methylation levels of the CpG sites (count data, ratio data,and both count and ratio data). On that case, the methods will be distributed among the categories asfollows: i) Count data: MethylKit, eDMR, DSS, DSS-single, DSS-general, MOABS, RADmeth, MethylSig,MACAU, GetisDMR, ComMet; ii) Ratio data: BSmooth, BiSeq, qDMR, CpG MPs, SMART, HMM-Fisher,

14

Page 16: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

Figure 5: A higher level classification of the approaches discussed in this survey based on the data type usedwhen modeling the methylation levels of the CpG sites.

HMM-DM, COHCAP, metilene; iii) Both count and ratio data: DMAP, swDMR. A graphical representationof this classification is shown in Figure 5. Similarly, the approaches can be categorized based on the numberof groups allowed (one group of samples, two groups without replicates, and two groups with replicates),based on the protocol used (WGBS, RRBS, and both WGBS and RRBS), etc.

Biological variability within the replicates is a crucial factor to consider since it can reduce the numberof false positives in the results [14, 20, 26]. If an approach takes into account each biological replicate withina group separately when modeling the methylation levels of the CpG sites, then biological variability isconsidered. On the other hand, biological variability is lost if an approach combines the read counts of theCpG sites across the replicates. Although classical hypothesis testing methods (e.g., t-test and ANOVA) takebiological variation into account, BSmooth was the first approach primarily developed for DMR identificationthat takes into account the biological variation among replicates. Within the surveyed approaches, smoothingbased approaches, beta-binomial based approaches, entropy based approaches, etc. (see Table 1 for full list)take the biological variation among the replicates into account.

Spatial correlation is another factor to consider, which provides better estimation of the methylation lev-els of the CpG sites by borrowing information from the neighboring CpG sites. A popular way of consideringspatial correlation is to perform “smoothing” operation prior to the detection of differential methylation. Inthis survey, smoothing based approaches (BSmooth, and BiSeq) and few beta-binomial based approaches(DSS-single, MACAU, and GetisDMR) fall into this category. By performing smoothing when identify-ing DMRs can reduce the required sequencing depth and estimate the methylation status of missing CpGsites [20]. Additionally, smoothing procedure helps to identify relatively longer DMRs. However, this pro-cedure is only applicable for the genome whose methylation profile is known to be smooth. Also smoothingis not suitable for the dataset whose CpG sites are sparse (commonly seen in RRBS protocol) due to ex-trapolated methylation values of 0 and 1. Besides smoothing, there are other techniques as well that cantake spatial correlation into account. For instance, eDMR uses auto correlation of the methylation data,

15

Page 17: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

HMM based approaches (ComMet, HMM-Fisher, and HMM-DM) use hidden Markov model, CpG MPs useshotspot extension algorithm, and SMART uses euclidean distance based on methylation similarity to takeinto account spatial correlation of the CpG sites.

Sequencing coverage is another important factor that affects the accuracy of the methylation estimation.Count based hypothesis tests (e.g., Fisher’s exact test, Chi-square test) take into account sequencing coverageby simply pooling the read counts; however, these tests require grouping of read counts, and this is biasedtowards the samples with higher sequencing coverage. For other differential methylation analysis approaches,consideration of coverage information is not merely dependent on the hypothesis tests but dependent onwhether coverage information is incorporated when modeling the methylation levels of the CpG sites. Forexample, HMM-Fisher uses methylation ratios to estimate the methylation status at each CpG sites andthen applies FET on the count of methylation status of CpG sites to identify DMR. Therefore, HMM-Fisherdoes not take into account read coverage despite using FET as the hypothesis test. Among the surveyedapproaches, BiSeq, ComMet, DMAP, swDMR, logistic regression based, and beta-binomial based approachesare able to take the coverage information into account. Some approaches also include additional filters toremove low coverage CpG sites before estimating methylation.

Identifying de novo regions is another very important feature of the approaches that identify DM. Ap-proaches that identify de novo regions use various techniques such as merging DMCs using empirical thresh-olds, entropy based algorithms, binary segmentation, etc. to estimate DMR boundaries (see Figure 4). Whileempirical thresholds allow for more flexibility to the users, proper tuning of these parameters is necessaryin order to get robust results. Some of the approaches, in addition to the list of DMRs, provide informationsuch as the list of DMCs, genetic annotations, visualization of the DMRs, etc.

Error control is an important factor to consider as it reduces the number of false positives in the results.Approaches control errors by correcting p-values for each CpG site across the genome, correcting p-valuesfor each region, correcting the p-values within the identified regions, etc. In our survey, we also provideinformation whether an approach allows additional covariates or not.

Identification of the fittest approach, among all that are available, is a challenging task in differentialmethylation analysis. If biological replicates are available, beta-binomial approaches are suitable since theytake both coverage information and biological variability among the replicates into account. In addition, theycan identify low CpG density regions where methylation has sharp changes (e.g., transcription factor bindingsites). Within the beta-binomial based approaches DSS-single, MACAU, and GetisDMR accounts spatialcorrelation into account. Therefore, these three approaches are more appropriate if the methylation levelsof the CpG sites are known to be spatially correlated and biological replicates are available. Smoothingbased approaches, entropy based approaches, HMM-Fisher, HMM-DM, and metilene can also be appliedwhen biological replicates are available. Similarly, if the methylation levels of the CpG sites are known to bespatially correlated, approaches that take spatial distribution into consideration, such as smoothing basedapproaches, HMM-based approaches, DSS-single, MACAU, GetisDMR, CpG MPs, and SMART should beused.

When sample size is small in the dataset, DSS, MethylSig, and HMM-Fisher are appropriate. While DSSuses information from all CpG sites and an empirical Bayes estimate to achieve variation shrinkage, methylSiguses local information and employs a maximum likelihood estimator to compute both the methylation leveland the variance. HMM-Fisher, on the other hand, combines two CpG sites while conducting Fisher’s exacttest if the distance between them is less than 100 bases. If multiple experimental factors are available in thedataset, approaches such as methylKit, eDMR, BiSeq, RADMeth, MACAU, DSS-general and GetisDMRare more appropriate because they allow additional covariates in their model.

Suitable approaches can also be chosen based on their primary purposes. For example, QDMR, CpG MPsor HMM-Fisher can be used to identify methylation patterns from a single sample. To identify cell type-specific methylation marks from large sample cohorts, SMART is a suitable choice. To identify differentialmethylation patterns (hypermethylaion and hypomethylation) across two groups of samples, HMM-Fisherand HMM-DM are more appropriate. Approaches can be chosen based on the input data type as well. Forinstance, if the data protocol is RRBS, and the purpose is to identify DMRs, then QDMR, BiSeq, DSS-general, or COHCAP can be applied. In order to work with CHG or CHH methylation, methylKit, eDMR,MOABS, DSS, RADMeth and swDMR are recommended since they are not limited to CpG methylation.

Comparison of some of the approaches can be found from two existing review papers, Klein et al. [15]and Yu and Sun. [16]. Klein et al. compared four tools that are originally developed for DM analysis:

16

Page 18: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

Tab

le1:

Su

mm

ary

ofth

eim

por

tant

char

acte

rist

ics

of

the

22

surv

eyed

ap

pro

ach

es.

For

colu

mn

s5–9,

3m

ean

sth

at

the

met

hod

con

sid

ers

the

char

acte

rist

ican

d7

mea

ns

that

the

met

hod

does

not

con

sid

erth

ech

ara

cter

isti

c.F

or

the

8th

colu

mn

,3

*m

ean

sth

at

the

met

hod

con

sid

ers

sequ

enci

ng

cove

rage

wh

enco

unt

bas

edhyp

oth

esis

test

sar

ep

erfo

rmed

.F

or

the

9th

colu

mn

,id

enti

fyde

novo

regio

ns,

3m

ean

sth

at

the

met

hod

can

iden

tify

de

novo

regi

ons

and

7m

ean

sth

atth

em

ethod

can

not

iden

tify

de

novo

regio

ns.

For

colu

mn

s5–9,�

mea

ns

the

chara

cter

isti

cis

not

ap

pli

cab

le.

Tot

alci

tati

ons

and

Cit

atio

ns

per

year

rep

rese

nt

the

nu

mb

erof

cita

tion

san

dth

eav

erage

nu

mb

erof

cita

tion

sp

erye

ar,

resp

ecti

vel

y,as

show

non

goog

lesc

hol

aras

of10

/24/

2016

.D

Md

enot

esd

iffer

enti

al

met

hyla

tion

.D

MR

s,D

MF

s,D

MC

san

dD

ML

sd

enote

diff

eren

tially

met

hyla

ted

regio

ns,

diff

eren

tial

lym

ethyla

ted

frag

men

ts,

diff

eren

tial

lym

ethyla

ted

cyto

sin

esan

dd

iffer

enti

all

ym

ethyla

ted

loci

,re

spec

tive

ly.

Method

&R

ef.

Concept

used

Protocol

Prim

ary

purp

ose

Bio

log.

var.

Spatia

ldis

tr.

Addl.

cov.

Error

corr.

Seq.

cov.

Identif

ydenovo

regio

ns

Total

cit

.C

it./

year

1m

ethylK

it[2

1]

Logis

tic

regre

ssio

nB

oth

Identi

fyD

MC

sand

annota

te7

73

33

3175

43.7

5

2eD

MR

[22]

Logis

tic

regre

ssio

nB

oth

Identi

fyD

MC

sand

DM

Rs

73

33

33

28

8

3B

Sm

ooth

[20]

Sm

ooth

ing

WG

BS

Identi

fyD

MR

sw

ith

replicate

s3

37

77

3156

39

4B

iSeq

[24]

Sm

ooth

ing

RR

BS

Identi

fyD

MR

sw

ith

FD

Rcorr

ec-

tion

33

33

37

62

18

6D

SS

[26]

Beta

-bin

om

ial

Both

Identi

fyD

ML

sfo

rsm

all

sam

ple

s3

77

73

343

16.1

5M

OA

BS

[27]

Beta

-bin

om

ial

Both

Identi

fyD

MC

sw

ith

replicate

s3

77

73

349

18.4

7R

AD

Meth

[30]

Beta

-bin

om

ial

WG

BS

Identi

fyD

ML

sand

DM

Rs

37

37

33

31

13.3

8m

ethylS

ig[2

9]

Beta

-bin

om

ial

Both

Identi

fyD

MC

sand

DM

Rs

37

77

33

42

17.4

9D

SS-s

ingle

[31]

Beta

-bin

om

ial

Both

Identi

fyD

MR

sw

ithout

replicate

s3

37

33

315

12

10

MA

CA

U[3

3]

Beta

-bin

om

ial

Both

Identi

fyD

Musi

ng

popula

tion

stru

ctu

re3

33

33

38

8

11

DSS-g

eneral

[36]

Beta

-bin

om

ial

RR

BS

Identi

fyD

ML

s3

73

33

73

3

12

Getis

DM

R[3

9]

Beta

-bin

om

ial

WG

BS

Identi

fyD

MR

sdir

ectl

y3

33

33

30

0

13

Com

Met

[25]

HM

MB

oth

Identi

fyD

MR

s7

37

73

324

8.7

14

HM

M-F

isher

[37]

HM

MB

oth

Identi

fyD

Mpatt

ern

s3

37

77

34

4

15

HM

M-D

M[3

8]

HM

MB

oth

Identi

fyD

MR

s3

37

77

34

4

16

QD

MR

[18]

Shannon

entr

opy

RR

BS

Identi

fyD

MR

s3

7�

�7

761

10.7

17

CpG

MP

s[1

9]

Shannon

entr

opy

WG

BS

Identi

fyD

Mpatt

ern

s3

3�

�7

330

7.2

18

SM

AR

T[3

5]

Shannon

entr

opy

WG

BS

Identi

fycell

typ

e-s

pecifi

cm

eth

yla

-ti

on

mark

s3

3�

�7

39

9

19

CO

HC

AP

[23]

Mix

ed

stati

stic

sR

RB

SId

enti

fyD

MC

sand

consi

stent

CpG

isla

nds

37

73

77

27

7.7

20

DM

AP

[28]

Mix

ed

stati

stic

sB

oth

Identi

fyD

MR

sand

DM

Fs

37

73

3*

331

12.4

21

sw

DM

R[3

2]

Mix

ed

stati

stic

sW

GB

SId

enti

fyD

MR

sw

ithout

replicate

s7

77

33

*3

43.2

22

metil

ene

[34]

Bin

ary

segm

enta

tion

Both

Identi

fyD

MR

sin

larg

egro

ups

of

sam

ple

s3

77

37

30

0

17

Page 19: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

Tab

le2:

Com

par

ison

ofth

eav

aila

ble

imp

lem

enta

tion

sof

the

22

surv

eyed

ap

pro

ach

es.∗ R

AD

Met

his

now

part

of

the

Met

hP

ipe

tool

rele

ase

don

09/0

6/20

13w

ith

the

late

stu

pd

ate

on10

/21/

2016.∗∗

CC

AN

S:

Cre

ati

ve

Com

mon

sA

ttri

bu

tion

-Non

Com

mer

cial-

Sh

are

Ali

ke3.0

Un

port

edL

icen

se.

∗∗∗ C

ust

omli

cen

sest

atin

gth

atth

eso

ftw

are

isfr

eeof

charg

eto

rese

arc

her

sw

ork

ing

at

aca

dem

ic,

non

-pro

fit

org

an

izati

on

son

non

-com

mer

cial

pro

ject

s.∗∗∗∗

PS

FL

:P

yth

onS

oftw

are

Fou

nd

atio

nL

icen

se.

CL

I:co

mm

an

dli

ne

inte

rface

.

Method

(T

ool)

&T

ool

ref.

Pla

tfo

rm

Avail

abil

ity

Lic

ense

Output

Publi

shed

date

Up

dated

date

1m

ethylK

it[2

1]

Bic

onducto

rR

pack

age

standalo

ne

Art

isti

cv2

DM

Cs/

DM

Rs

list

(table

)D

MC

s/D

MR

sp

er

chrr

om

oso

me

(gra

ph)

11/09/2011

10/22/2016

2eD

MR

[21,

22]

Rpack

age

standalo

ne

Art

isti

c/G

PL

DM

Rs

list

(table

)01/04/2013

04/04/2014

3B

Sm

ooth

(bsseq)

[20]

Bic

onducto

rR

pack

age

standalo

ne

Art

isti

cv2

DM

Rs

list

(table

)D

MR

/lo

cus

meth

yla

tion

level

(gra

ph)

07/20/2012

10/14/2016

4B

iSeq

[88]

Bic

onducto

rR

pack

age

standalo

ne

LG

PL

v3

DM

Rs

list

(table

)D

MR

mean

meth

yla

tion

(gra

ph)

04/02/2013

10/17/2016

6D

SS

[89,

26,

31,

36]

Bic

onducto

rR

pack

age

standalo

ne

GN

UG

PL

DM

Cs/

DM

Rs

list

(table

)D

MR

%m

eth

yla

tion

/lo

cus

(gra

ph)

06/04/2012

10/17/2016

5M

OA

BS

[27]

C+

+pack

age

&P

erl

scri

pt

standalo

ne

GN

UG

PL

v3

DM

Rs

list

(table

)06/12/2013

05/30/2015

7R

AD

Meth

[30]

C+

+pack

age

standalo

ne

GN

UG

PL

v3

DM

Cs/

DM

Rs

list

(table

)03/27/2014

05/01/2014∗

8m

ethylS

ig[2

9]

Rpack

age

standalo

ne

GN

UG

PL

v3

DM

Cs/

DM

Rs

list

(table

)C

pG

site

sm

eth

yla

tion

rate

(gra

ph)

06/17/2014

06/10/2016

9D

SS-s

ingle

(D

SS)

[89,

26,

31,

36]

Bic

onducto

rR

pack

age

standalo

ne

GN

UG

PL

DM

Cs/

DM

Rs

list

(table

)D

MR

%m

eth

yla

tion

/lo

cus

(gra

ph)

04/16/2015

10/17/2016

10

MA

CA

U[3

3]

C+

+pack

age

&R

scri

pt

standalo

ne

GN

UG

PL

DM

Cs

list

(table

)06/05/2015

12/09/2015

11

DSS-g

eneral

(D

SS)

[89,

26,

31,

36]

Bic

onducto

rR

pack

age

standalo

ne

GN

UG

PL

DM

Cs/

DM

Rs

list

(table

)D

MR

%m

eth

yla

tion

/lo

cus

(gra

ph)

04/29/2015

10/17/2016

12

Getis

DM

R[3

9]

C+

+pack

age

&R

scri

pts

standalo

ne

GN

UG

PL

DM

Rs

list

(table

)04/28/2016

09/28/2016

13

Com

Met

(B

isulfi

ghter)

[25]

C+

+pack

age

&P

yth

on

standalo

ne

CC

AN

S∗∗

DM

Rs

list

(table

)12/12/2014

09/29/2015

14

HM

M-F

isher

[37]

Rsc

ripts

standalo

ne

none

DM

Rs

list

(table

)D

MR

/lo

cus

meth

yla

tion

level

(gra

ph)

04/25/2014

02/29/2016

15

HM

M-D

M[3

8]

Rsc

ripts

standalo

ne

none

DM

Rs

list

(table

)D

MR

/lo

cus

meth

yla

tion

level

(gra

ph)

03/27/2014

03/24/2016

16

QD

MR

[18]

Java

pack

age

standalo

ne

web,

CL

Icust

om

∗∗∗

DM

Rs

list

(table

)D

MR

inU

CSC

Genom

eB

row

ser

(gra

ph)

05/10/2010

10/17/2012

17

CpG

MP

s[1

9]

Java

pack

age

&P

erl

scri

pt

standalo

ne

web,

CL

Inone

DM

Rs

list

(table

)06/20/2011

09/01/2015

18

SM

AR

T[3

5]

(SM

AR

T-B

S-S

eq)

Pyth

on

pack

age

standalo

ne

PSF

L∗∗∗∗

DM

Rs

list

(table

)05/17/2015

05/17/2015

19

CO

HC

AP

[23]

Bic

onducto

rR

pack

age

standalo

ne

GN

UG

PL

v3

DM

Cs

&D

MC

pG

isla

nds

list

(table

)D

MC

pG

isla

nds

meth

yla

tion

avg.

(gra

ph)

01/09/2014

10/17/2016

20

DM

AP

[28]

(m

eth

progs

dis

t)

Cpack

age

standalo

ne

none

DM

Rs

list

(table

)05/14/2013

08/28/2016

21

sw

DM

R[3

2]

Perl

&R

scri

pts

standalo

ne

GN

UG

PL

v3

DM

Rs

list

(table

)D

MR

meth

yla

tion

level

(gra

ph)

01/06/2013

06/15/2014

22

metil

ene

[34]

Cpack

age

standalo

ne

GN

UG

PL

v2

DM

Rs

list

(table

)05/08/2015

04/29/2016

18

Page 20: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

BiSeq [88], COHCAP [23], methylKit [21], and RADMeth [30]. This review evaluates the trade-off betweenthe sensitivity and specificity for individual methods using the receiver operator characteristic (ROC) basedon the regional p-values of the identified regions. The performance of each method is then assessed bycomputing and comparing the area under the ROC curve. According to this review, BiSeq and RADMethoutperform COHCAP and methylKit. Yu and Sun [16] compared BSmooth, methylKit, BiSeq, HMM-Fisher, and HMM-DM. According to this review, HMM-Fisher and HMM-DM achieved higher sensitivityand specificity than the other three methods. In order to assess the performance of all of the availableapproaches, a benchmark analysis is needed. Due to the complex nature of the methylation data, the lack ofa gold standard for performance evaluation, and the lack of standardized format of the input data, buildinga benchmark for assessing the efficiency of these approaches is a challenging task and out of the scope ofthis survey.

In addition to the conceptual overview, we also summarized the implementations of the approaches inTable 2. The summary includes platform information, license information, output format, published date,last update date, etc. While this is a condensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the input and output formats. Such details as well asa simulated, noise free dataset with known results are further requirements towards creating a comprehensivebenchmark for assessing the practical performance of DM detection tools.

Conclusion

Epigenetic modifications are thought to play a role in developmental disorders and cancer, are likely to beinfluenced by environmental factors, and they are known to regulate gene expression. Reliable identifica-tion of differential methylation using bisulfite sequencing data is an important problem in the analysis ofepigenetic data and statistical methods have employed to address the problem. In this study, we survey 22methods that identify differential methylation from bisulfite sequencing data. All the approaches surveyed inthis paper were developed within the last five years, which shows great interest for progress in this area. Ourmain objective in this survey is to provide the community a comprehensive view of the existing approachesthat identify differential methylation from bisulfite sequencing data. To do that, we classify the approachesinto seven categories based on their primary concepts and features. We summarize the distinguishing char-acteristics, benefits and limitations of each approach and category. This survey is intended to help potentialusers to choose the best differential methylation analysis method based on their requirements. It will helpthe researchers to design experiments to generate data that is better suited for the community. In addition,this survey will guide the developers to develop new efficient statistical models that identify differentialmethylation by considering key characteristics described here.

Key points

• Identification of the fittest approach, among all that are available, is a challenging task in DM analysis.

• A comprehensive benchmark of the available approaches that identify DM is greatly needed.

• Due to the high computation cost, only a few web-based implementations of the approaches are cur-rently available.

19

Page 21: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

Short description of the authors

Adib Shafi is a PhD candidate in the Department of Computer Science at Wayne State University,USA. His research interests include biological pathway analysis, finding mechanism using multi-omicsdata and variant analysis.Cristina Mitrea is a PhD student in the Department of Computer Science at Wayne State University,USA. Her work is focused on research in data mining techniques applied to bioinformatics and com-putational biology. Other interests include network discovery and meta-analysis applied to pathwayanalysis.Tin Nguyen received his PhD from the Computer Science Department at Wayne State University.His research interests include computational and statistical methods for analyzing high-throughputdata. His current foci are meta-analysis and multi-omics data integration.Sorin Draghici currently holds the Robert J. Sokol, MD Endowed Chair in Systems Biology, as wellas appointments as Full Professor with the Department of Computer Science and the Department ofObstetrics and Gynecology, Wayne State University. He is also the head of the Intelligent Systemsand Bioinformatics Laboratory (ISBL) in the Department of Computer Science. His work is focusedon research in artificial intelligence, machine learning and data mining techniques applied to bioinfor-matics and computational biology. He has published two best-selling books on data analysis of highthroughput genomics data, 8 book chapters and over 160 peer-reviewed journal and conference papers.

Funding

This work was partially supported by the following grants: National Institutes of Health [RO1 DK089167,STTR R42GM087013]; National Science Foundation [DBI-0965741]; and by the Robert J. Sokol M.D. En-dowment in Systems Biology (to SD). Any opinions, findings, and conclusions or recommendations expressedin this material are those of the authors and do not necessarily reflect the views of any of the funding agencies.

References

[1] Aimee M Deaton and Adrian Bird. “CpG islands and the regulation of transcription”. In: Genes &Development 25.10 (2011), pp. 1010–1022.

[2] Manel Esteller. “Cancer epigenomics: DNA methylomes and histone-modification maps”. In: NatureReviews Genetics 8.4 (2007), pp. 286–298.

[3] Ryan Lister, Mattia Pelizzola, Robert H Dowen, et al. “Human DNA methylomes at base resolutionshow widespread epigenomic differences”. In: Nature 462.7271 (2009), pp. 315–322.

[4] Felix Krueger, Benjamin Kreck, Andre Franke, et al. “DNA methylome analysis using short bisulfitesequencing data”. In: Nature methods 9.2 (2012), pp. 145–151.

[5] Suhua Feng, Steven E Jacobsen, and Wolf Reik. “Epigenetic reprogramming in plant and animaldevelopment”. In: Science 330.6004 (2010), pp. 622–627.

[6] Anders M Lindroth, Xiaofeng Cao, James P Jackson, et al. “Requirement of CHROMOMETHYLASE3for maintenance of CpXpG methylation”. In: Science 292.5524 (2001), pp. 2077–2080.

[7] Achim Breiling and Frank Lyko. “Epigenetic regulatory functions of DNA modifications: 5-methylcytosineand beyond”. In: Epigenetics & chromatin 8.1 (2015), p. 1.

[8] Brian Hendrich and Adrian Bird. “Identification and characterization of a family of mammalian methyl-CpG binding proteins”. In: Molecular and Cellular Biology 18.11 (1998), pp. 6538–6547.

[9] Adrian P Bird and Alan P Wolffe. “Methylation-induced repression—belts, braces, and chromatin”.In: Cell 99.5 (1999), pp. 451–454.

[10] Peter A Jones. “Functions of DNA methylation: islands, start sites, gene bodies and beyond”. In:Nature Reviews Genetics 13.7 (2012), pp. 484–492.

20

Page 22: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

[11] R Alan Harris, Ting Wang, Cristian Coarfa, et al. “Comparison of sequencing-based methods to profileDNA methylation and identification of monoallelic epigenetic modifications”. In: Nature Biotechnology28.10 (2010), pp. 1097–1105.

[12] Oluwatosin Taiwo, Gareth A Wilson, Tiffany Morris, et al. “Methylome analysis using MeDIP-seq withlow DNA concentrations”. In: Nature Protocols 7.4 (2012), pp. 617–636.

[13] Hongcang Gu, Christoph Bock, Tarjei S Mikkelsen, et al. “Genome-scale DNA methylation mappingof clinical samples at single-nucleotide resolution”. In: Nature Methods 7.2 (2010), pp. 133–136.

[14] Mark D Robinson, Abdullah Kahraman, Charity W Law, et al. “Statistical methods for detectingdifferentially methylated loci and regions”. In: Frontiers in Genetics 5 (2014), p. 324.

[15] Hans-Ulrich Klein and Katja Hebestreit. “An evaluation of methods to test predefined genomic re-gions for differential methylation in bisulfite sequencing data”. In: Briefings in Bioinformatics (2015),bbv095.

[16] Xiaoqing Yu and Shuying Sun. “Comparing five statistical methods of differential methylation identifi-cation using bisulfite sequencing data”. In: Statistical Applications in Genetics and Molecular Biology15.2 (2016), pp. 173–191.

[17] Zhifu Sun, Julie Cunningham, Susan Slager, et al. “Base resolution methylome profiling: considerationsin platform selection, data preprocessing and analysis”. In: 7.5 (2015), pp. 813–828.

[18] Yan Zhang, Hongbo Liu, Jie Lv, et al. “QDMR: a quantitative method for identification of differentiallymethylated regions by entropy”. In: Nucleic Acids Research 39.9 (2011), e58.

[19] Jianzhong Su, Haidan Yan, Yanjun Wei, et al. “CpG MPs: identification of CpG methylation patternsof genomic regions from high-throughput bisulfite sequencing data”. In: Nucleic Acids Research 41.1(2013), e4.

[20] Kasper D Hansen, Benjamin Langmead, and Rafael A Irizarry. “BSmooth: from whole genome bisulfitesequencing reads to differentially methylated regions”. In: Genome Biology 13.10 (2012), R83.

[21] Altuna Akalin, Matthias Kormaksson, Sheng Li, et al. “methylKit: a comprehensive R package for theanalysis of genome-wide DNA methylation profiles”. In: Genome Biology 13.10 (2012), R87.

[22] Sheng Li, Francine E Garrett-Bakelman, Altuna Akalin, et al. “An optimized algorithm for detectingand annotating regional differential methylation”. In: BMC Bioinformatics 14.Suppl 5 (2013), S10.

[23] Charles D Warden, Heehyoung Lee, Joshua D Tompkins, et al. “COHCAP: an integrative genomicpipeline for single-nucleotide resolution DNA methylation analysis”. In: Nucleic Acids Research 41.11(2013), e117.

[24] Katja Hebestreit, Martin Dugas, and Hans-Ulrich Klein. “Detection of significantly differentially methy-lated regions in targeted bisulfite sequencing data”. In: Bioinformatics 29.13 (2013), pp. 1647–1653.

[25] Yutaka Saito, Junko Tsuji, and Toutai Mituyama. “Bisulfighter: accurate detection of methylatedcytosines and differentially methylated regions”. In: Nucleic Acids Research (2014), e45.

[26] Hao Feng, Karen N Conneely, and Hao Wu. “A Bayesian hierarchical model to detect differentiallymethylated loci from single nucleotide resolution sequencing data”. In: Nucleic Acids Research 42.8(2014), e69.

[27] Deqiang Sun, Yuanxin Xi, Benjamin Rodriguez, et al. “MOABS: model based analysis of bisulfitesequencing data”. In: Genome Biology 15.2 (2014), R38.

[28] Peter A Stockwell, Aniruddha Chatterjee, Euan J Rodger, et al. “DMAP: differential methylationanalysis package for RRBS and WGBS data”. In: Bioinformatics 30.13 (2014), pp. 1814–1822.

[29] Yongseok Park, Maria E Figueroa, Laura S Rozek, et al. “MethylSig: a whole genome DNA methylationanalysis pipeline”. In: Bioinformatics (2014), btu339.

[30] Egor Dolzhenko and Andrew D Smith. “Using beta-binomial regression for high-precision differentialmethylation analysis in multifactor whole-genome bisulfite sequencing experiments”. In: BMC Bioin-formatics 15.1 (2014), p. 215.

21

Page 23: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

[31] Hao Wu, Tianlei Xu, Hao Feng, et al. “Detection of differentially methylated regions from whole-genomebisulfite sequencing data without replicates”. In: Nucleic Acids Research 43.21 (2015), e141.

[32] Zhen Wang, Xianfeng Li, Yi Jiang, et al. “swDMR: a sliding window approach to identify differentiallymethylated regions based on whole genome bisulfite sequencing”. In: PloS one 10.7 (2015), e0132866.

[33] Amanda J Lea, Jenny Tung, and Xiang Zhou. “A flexible, efficient binomial mixed model for identifyingdifferential DNA methylation in bisulfite sequencing data”. In: PLoS Genetics 11.11 (2015), e1005650.

[34] Frank Juhling, Helene Kretzmer, Stephan H Bernhart, et al. “metilene: Fast and sensitive calling ofdifferentially methylated regions from bisulfite sequencing data”. In: Genome Research 26.2 (2016),pp. 256–262.

[35] Hongbo Liu, Xiaojuan Liu, Shumei Zhang, et al. “Systematic identification and annotation of humanmethylation marks based on bisulfite sequencing methylomes reveals distinct roles of cell type-specifichypomethylation in the regulation of cell identity genes”. In: Nucleic Acids Research 44.1 (2016),pp. 75–94.

[36] Yongseok Park and Hao Wu. “Differential methylation analysis for BS-seq data under general experi-mental design”. In: Bioinformatics 32.10 (2016), pp. 1446–1453.

[37] Shuying Sun and Xiaoqing Yu. “HMM-Fisher: identifying differential methylation using a hiddenMarkov model and Fisher’s exact test”. In: Statistical applications in genetics and molecular biology15.1 (2016), pp. 55–67.

[38] Xiaoqing Yu and Shuying Sun. “HMM-DM: identifying differentially methylated regions using a hiddenMarkov model”. In: Statistical applications in genetics and molecular biology 15.1 (2016), pp. 69–81.

[39] Yalu Wen, Fushun Chen, Qingzheng Zhang, et al. “Detection of differentially methylated regions inwhole genome bisulfite sequencing data using local Getis-Ord statistics”. In: Bioinformatics (2016),btw497.

[40] Susan J Clark, Aaron Statham, Clare Stirzaker, et al. “DNA methylation: bisulphite modification andanalysis”. In: Nature Protocols 1.5 (2006), pp. 2353–2364.

[41] Alexander Meissner, Andreas Gnirke, George W Bell, et al. “Reduced representation bisulfite sequenc-ing for comparative high-resolution DNA methylation analysis”. In: Nucleic Acids Research 33.18(2005), pp. 5868–5877.

[42] FASTX-Toolkit: FASTQ/A short-reads pre-processing tools. http://hannonlab.cshl.edu/fastx_toolkit/. 2010.

[43] Robert Schmieder and Robert Edwards. “Quality control and preprocessing of metagenomic datasets”.In: Bioinformatics 27.6 (2011), pp. 863–864.

[44] Murray P Cox, Daniel A Peterson, and Patrick J Biggs. “SolexaQA: At-a-glance quality assessment ofIllumina second-generation sequencing data”. In: BMC Bioinformatics 11.1 (2010), p. 1.

[45] Marcel Martin. “Cutadapt removes adapter sequences from high-throughput sequencing reads”. In:EMBnet. journal 17.1 (2011), pp–10.

[46] Anthony M Bolger, Marc Lohse, and Bjoern Usadel. “Trimmomatic: a flexible trimmer for Illuminasequence data”. In: Bioinformatics 30.15 (2014), pp. 2114–2120.

[47] Trim Galore! http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/.

[48] Felix Krueger and Simon R Andrews. “Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications”. In: Bioinformatics 27.11 (2011), pp. 1571–1572.

[49] Pao-Yang Chen, Shawn J Cokus, and Matteo Pellegrini. “BS Seeker: precise mapping for bisulfitesequencing”. In: BMC Bioinformatics 11.1 (2010), p. 203.

[50] Brent Pedersen, Tzung-Fu Hsieh, Christian Ibarra, et al. “MethylCoder: software pipeline for bisulfite-treated sequences”. In: Bioinformatics 27.17 (2011), pp. 2435–2436.

[51] Elena Y Harris, Nadia Ponts, Aleksandr Levchuk, et al. “BRAT: bisulfite-treated reads analysis tool”.In: Bioinformatics 26.4 (2010), pp. 572–573.

22

Page 24: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

[52] Changjin Hong, Nathan L Clement, Spencer Clement, et al. “Probabilistic alignment leads to improvedaccuracy and read coverage for bisulfite sequencing data”. In: BMC Bioinformatics 14.1 (2013), p. 1.

[53] Ben Langmead, Cole Trapnell, Mihai Pop, et al. “Ultrafast and memory-efficient alignment of shortDNA sequences to the human genome”. In: Genome Biololgy 10.3 (2009), R25.

[54] Ben Langmead and Steven L Salzberg. “Fast gapped-read alignment with Bowtie 2”. In: Nature Meth-ods 9.4 (2012), pp. 357–359.

[55] Yuanxin Xi and Wei Li. “BSMAP: whole genome bisulfite sequence MAPping program”. In: BMCBioinformatics 10.1 (2009), p. 232.

[56] Yuanxin Xi, Christoph Bock, Fabian Muller, et al. “RRBSMAP: a fast, accurate and user-friendlyalignment tool for reduced representation bisulfite sequencing”. In: Bioinformatics 28.3 (2012), pp. 430–432.

[57] Thomas D Wu and Serban Nacu. “Fast and SNP-tolerant detection of complex variants and splicingin short reads”. In: Bioinformatics 26.7 (2010), pp. 873–881.

[58] Andrew D Smith, Wen-Yu Chung, Emily Hodges, et al. “Updates to the RMAP short-read mappingsoftware”. In: Bioinformatics 25.21 (2009), pp. 2841–2842.

[59] Christoph Bock, Sabine Reither, Thomas Mikeska, et al. “BiQ Analyzer: visualization and quality con-trol for DNA methylation data from bisulfite sequencing”. In: Bioinformatics 21.21 (2005), pp. 4067–4068.

[60] Yuichi Kumaki, Masaaki Oda, and Masaki Okano. “QUMA: quantification tool for methylation anal-ysis”. In: Nucleic Acids Research 36.suppl 2 (2008), W170–W175.

[61] Shuying Sun, Aaron Noviski, and Xiaoqing Yu. “MethyQA: a pipeline for bisulfite-treated methylationsequencing quality assessment”. In: BMC Bioinformatics 14.1 (2013), p. 1.

[62] Ke Hu, Angela H Ting, and Jing Li. “BSPAT: a fast online tool for DNA methylation co-occurrencepattern analysis based on high-throughput bisulfite sequencing data”. In: BMC Bioinformatics 16.1(2015), p. 1.

[63] Wen-Wei Liao, Ming-Ren Yen, Evaline Ju, et al. “MethGo: a comprehensive tool for analyzing whole-genome bisulfite sequencing data”. In: BMC Genomics 16.12 (2015), p. 1.

[64] Florian Eckhardt, Joern Lewin, Rene Cortese, et al. “DNA methylation profiling of human chromo-somes 6, 20 and 22”. In: Nature Genetics 38.12 (2006), pp. 1378–1385.

[65] Andrew E Jaffe, Andrew P Feinberg, Rafael A Irizarry, et al. “Significance analysis and statisticaldissection of variably methylated regions”. In: Biostatistics 13.1 (2012), pp. 166–178.

[66] Andrew P Feinberg and Rafael A Irizarry. “Stochastic epigenetic variation as a driving force of devel-opment, evolutionary adaptation, and disease”. In: Proceedings of the National Academy of Sciences107.suppl 1 (2010), pp. 1757–1764.

[67] Elizabeth E Cameron, Stephen B Baylin, and James G Herman. “p15INK4B CpG island methylationin primary acute leukemia is heterogeneous and suggests density as a critical factor for transcriptionalsilencing”. In: Blood 94.7 (1999), pp. 2445–2451.

[68] Sebastien A Smallwood, Heather J Lee, Christof Angermueller, et al. “Single-cell genome-wide bisulfitesequencing for assessing epigenetic heterogeneity”. In: Nature Methods 11.8 (2014), pp. 817–820.

[69] Katherine E Varley, David G Mutch, Tina B Edmonston, et al. “Intra-tumor heterogeneity of MLH1promoter methylation revealed by deep single molecule bisulfite sequencing”. In: Nucleic Acids Research37.14 (2009), pp. 4603–4612.

[70] Zakary S Singer, John Yong, Julia Tischler, et al. “Dynamic heterogeneity and DNA methylation inembryonic stem cells”. In: Molecular Cell 55.2 (2014), pp. 319–331.

[71] Marina Bibikova, Eugene Chudin, Bonnie Wu, et al. “Human embryonic stem cells have a uniqueepigenetic signature”. In: Genome research 16.9 (2006), pp. 1075–1083.

23

Page 25: A survey of the approaches for identifying differential ... · PDF fileA survey of the approaches for identifying di erential methylation using bisul te sequencing data Adib Sha ,

[72] Hyang-Min Byun, Kimberly D Siegmund, Fei Pan, et al. “Epigenetic profiling of somatic tissues fromhuman autopsy specimens identifies tissue-and individual-specific DNA methylation patterns”. In:Human molecular genetics 18.24 (2009), pp. 4808–4817.

[73] Stuart H Hurlbert. “Pseudoreplication and the design of ecological field experiments”. In: Ecologicalmonographs 54.2 (1984), pp. 187–211.

[74] Charlotte Soneson and Mauro Delorenzi. “A comparison of methods for differential expression analysisof RNA-seq data”. In: BMC Bioinformatics 14.1 (2013), p. 1.

[75] Hon Keung Tony Ng and Man-Lai Tang. “Testing the equality of two Poisson means using the rateratio”. In: Statistics in medicine 24.6 (2005), pp. 955–965.

[76] William S Gosset. “The Probable Error of a Mean”. In: Biometrika 6 (1908), pp. 1–25.

[77] E. S Pearson and H. O Hartley. “Biometrika tables for statisticians (vol. 2)”. In: Biometrika Trust(1976), p. 385.

[78] Gordon K. Smyth. “Linear models and empirical Bayes methods for assessing differential expressionin microarray experiments”. In: Statistical applications in genetics and molecular biology 3.1 (2004),Article3.

[79] Jelle J Goeman, Sara A Van De Geer, Floor De Kort, et al. “A global test for groups of genes: testingassociation with a clinical outcome”. In: Bioinformatics 20.1 (2004), pp. 93–99.

[80] Andrew Gelman. “Analysis of variance—why it is more important than ever”. In: The Annals ofStatistics 33.1 (2005), pp. 1–53.

[81] Hong-Qiang Wang, Lindsey K Tuominen, and Chung-Jui Tsai. “SLIM: a sliding linear model forestimating the proportion of true null hypotheses in datasets with dependence structures”. In: Bioin-formatics 27.2 (2011), pp. 225–231.

[82] Brent S Pedersen, David A Schwartz, Ivana V Yang, et al. “Comb-p: software for combining, analyzing,grouping and correcting spatially correlated P-values”. In: Bioinformatics 28.22 (2012), pp. 2986–2988.

[83] Yoav Benjamini and Yosef Hochberg. “Multiple hypotheses testing with weights”. In: ScandinavianJournal of Statistics 24.3 (1997), pp. 407–418.

[84] Ho Sung Rhee and B Franklin Pugh. “Comprehensive genome-wide protein-DNA interactions detectedat single-nucleotide resolution”. In: Cell 147.6 (2011), pp. 1408–1419.

[85] Dmitri V Zaykin. “Optimally weighted Z-test is a powerful method for combining probabilities inmeta-analysis”. In: Journal of Evolutionary Biology 24.8 (2011), pp. 1836–1841.

[86] Yutaka Saito and Toutai Mituyama. “Detection of differentially methylated regions from bisulfite-seqdata by hidden Markov models incorporating genome-wide methylation level distributions”. In: BMCGenomics 16.12 (2015), p. 1.

[87] Claude Elwood Shannon. “A mathematical theory of communication”. In: ACM SIGMOBILE MobileComputing and Communications Review 5.1 (2001), pp. 3–55.

[88] Katja Hebestreit and Hans-Ulrich Klein. BiSeq: Processing and analyzing bisulfite sequencing data. Rpackage version 1.14.0. 2015.

[89] Hao Wu, Chi Wang, and Zhijin Wu. “A new shrinkage estimator for dispersion improves differentialexpression detection in RNA-seq data”. In: Biostatistics 14.2 (2013), pp. 232–243.

24

View publication statsView publication stats