comparison and quantitative verification of mapping algorithms … · 2017-04-14 · comparison and...

Comparison and quantitative verification of mappingalgorithms for whole-genome bisulfite sequencingGovindarajan Kunde-Ramamoorthy1 Cristian Coarfa2 Eleonora Laritsky1

Noah J Kessler3 R Alan Harris4 Mingchu Xu4 Rui Chen4 Lanlan Shen1

Aleksandar Milosavljevic4 and Robert A Waterland14

1Department of Pediatrics Baylor College of Medicine USDAARS Childrenrsquos Nutrition Research CenterHouston TX 77030 USA 2Department of Molecular amp Cell Biology Baylor College of Medicine HoustonTX 77030 USA 3Department of Biology and Biochemistry University of Houston Houston TX 77204 USA and4Department of Molecular amp Human Genetics Baylor College of Medicine Houston TX 77030 USA

Received June 3 2013 Revised November 4 2013 Accepted November 29 2013

ABSTRACT

Coupling bisulfite conversion with next-generationsequencing (Bisulfite-seq) enables genome-widemeasurement of DNA methylation but posesunique challenges for mapping However despite aproliferation of Bisulfite-seq mapping tools nosystematic comparison of their genomic coverageand quantitative accuracy has been reportedWe sequenced bisulfite-converted DNA from twotissues from each of two healthy human adultsand systematically compared five widely usedBisulfite-seq mapping algorithms BismarkBSMAP Pash BatMeth and BS Seeker Weevaluated their computational speed and genomiccoverage and verified their percentage methylationestimates With the exception of BatMeth allmappers covered gt70 of CpG sites genome-wide and yielded highly concordant estimates ofpercentage methylation (r2

095) Fourfold variationin mapping time was found between BSMAP(fastest) and Pash (slowest) In each library 8ndash12of genomic regions covered by Bismark and Pashwere not covered by BSMAP An experiment usingsimulated reads confirmed that Pash has an excep-tional ability to uniquely map reads in genomicregions of structural variation Independent verifica-tion by bisulfite pyrosequencing generally confirmedthe percentage methylation estimates by themappers Of these algorithms Bismark providesan attractive combination of processing speedgenomic coverage and quantitative accuracywhereas Pash offers considerably higher genomiccoverage

INTRODUCTION

DNA methylation which occurs predominantly at cyto-sines within CpG dinucleotides in the mammaliangenome is an epigenetic mark fundamental to develop-mental processes including genomic imprinting silencingof transposable elements and differentiation (1) Couplingbisulfite modification (2) with next-generation sequencing(Bisulfite-seq) provides information about cytosine methy-lation genome-wide at single-base resolution (3ndash5)Bisulfite modification deaminates unmethylated cytosines(ie most cytosines) to uracil and these are subsequentlyconverted to thymine during polymerase chain reactionamplification The consequent reduced sequence complex-ity makes it challenging to map Bisulfite-seq reads to thereference genome using standard short read alignmenttools (6) Additionally the advent of Bisulfite-seq forcesthe question of what is the optimal resolution at which tostudy the methylome Regional methylation changes en-compassing several CpG sites may be more biologicallymeaningful than those occurring only at individualCpGs further it may often be impractical to performanalysis and validation at the level of individual CpG sitesSeveral approaches have been developed to map

Bisulfite-seq reads (6) including lsquowild cardrsquo and lsquothreeletterrsquo aligning There are two variations of the wild cardapproach the first allows either Cs or Ts in reads to mapto Cs in the reference genome (7) The second enumeratesall C to T combinations for each k seed-length and thenaligns by hashing and extension (89) In the three-letterapproach all Cs in both the reference genome and readsare converted to Ts and mapping is performed using aseed and extend approach (10ndash12) Both strategies can useeither gapped or ungapped alignment depending on theunderlying short read alignment tool The gapped align-ment method handles substitutions and small indelsefficiently (13)

To whom correspondence should be addressed Tel +1 713 798 0304 Fax +1 713 798 7101 Email waterlandbcmedu

Published online 3 January 2014 Nucleic Acids Research 2014 Vol 42 No 6 e43doi101093nargkt1325

The Author(s) 2014 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (httpcreativecommonsorglicensesby-nc30) which permits non-commercial re-use distribution and reproduction in any medium provided the original work is properly cited For commercialre-use please contact journalspermissionsoupcom

Bisulfite-seq mapping algorithms are mainly used toestimate percentage methylation at specific CpG sites(methylation calls) but also provide the ability to callsingle nucleotide and small indel variants (13) and copynumber and structural variants (14) The analyses in thisarticle focus exclusively on issues relevant to methylationcalls clearly an algorithmrsquos ability to map reads invarious sequence contexts and make accurate methylationcalls may have profound implications for the interpret-ation of Bisulfite-seq experiments Descriptions of newBisulfite-seq mapping tools typically compare globalmetrics such as proportion of uniquely mapped readsglobal percentage methylation computational require-ments and running time Benchmarking studies havebeen performed using real data downloaded from publicdatabases [mainly human (3) or plant data (15)] simulateddata (16) or combinations of both (10121718) Howevernone of these previous studies performed independentquantitative verification of methylation calls The onlyprevious comparison of Bisulfite-seq mapping algorithmsusing independently generated sequencing data involved asingle reduced representation bisulfite sequencing data setobtained from one human sample (19) That studyevaluated the mapping efficiency of Bismark BSMAPand RMAPBS (7) as a function of read length andadaptor sequences They also compared the totalnumber of methylated CpG sites genome-wide but didnot compare overlap of CpG sites covered by the threemappers or assess accuracy of methylation callsHence although proper analysis and interpretation of

the increasing number of expensive Bisulfite-seq datasets critically depends on their performance one mayconclude that widely used mapping methods remainpoorly characterized To address this need we selectedthree mapping algorithms for detailed comparisonBismark (11) BSMAP (9) and Pash (8) which useBowtie (20) SOAP (21) and in-house aligners respect-ively Bismark and BSMAP are the most widely usedthree-letter and wild card mapping algorithms respect-ively (6) Pash performs a heuristic alignment of k-mermatches To complement this main comparison we alsoevaluated the performance of two additional mapping al-gorithms BS Seeker (10) and BatMeth (17) We generatedfour human methylomes (representing two tissues fromeach of two individuals) and mapped the Bisulfite-seqreads independently using the five algorithms Despitegenerally excellent concordance of the mapping resultsour analysis (and subsequent independent verification byquantitative bisulfite pyrosequencing) highlights import-ant differences among the mapping algorithms

MATERIALS AND METHODS

Sample collection Bisulfite-seq library preparationand sequencing

Two tissue samples peripheral blood lymphocyte (PBL)and hair follicle (HF) from two healthy male adults (C01and C02) were collected in accordance with institutionalIRB regulations HFs (30ndash50) were obtained by pluckingscalp or facial hair from the same individuals PBLs were

isolated by Ficoll gradient centrifugation Tissueswere stored at 80C until isolation of genomic DNAby proteinase-k digestion and phenolndashchloroformextraction (22)

Illumina libraries were generated according to themanufacturerrsquos sample preparation protocol for genomicDNA Approximately 1 mg of genomic DNA was frag-mented to 200ndash500 bp and end-repaired The 50-ends ofDNA fragments were phosphorylated and a singleadenine base was added to the 30-end Illumina adaptorswere ligated to the genomic DNA

Bisulfite modification was performed using the EZDNA Methylation-Direct kit (Zymo Research) accordingto the manufacturerrsquos instructions The bisulfite-modifiedDNA was amplified by using adaptor-specific primersand fragments of 200ndash500 bp were isolated by bead puri-fication The quantity and size distribution of sequencinglibraries were determined using the Pico Green fluores-cence assay and the Agilent 2100 Bioanalyzer respect-ively The DNA was sequenced on the Illumina HiSeq2000 as 100-bp paired-end reads following the manufac-turerrsquos protocols

Reads quality control and mapping

For each library the standard Illumina pipeline was usedto perform base calling and the results were generated infastq format Quality control of reads was accessed byrunning the FastQC program FastQC (httpwwwbioinformaticsbabrahamacukprojectsfastqc) helps to de-termine the best quality score and the right read length fortrimming as the signals decline while Illumina cyclesprogress Because the majority of the bases had qualityscores 28 we decided to use a quality score filteringvalue 28 and read length 50 bp The Cutadapt (23)program was used to trim adaptor sequences andperform quality score and read length filtering Thisresulted in gt85 of the reads for mapping and analysis

QC-passed reads were mapped to the University ofCalifornia Santa Cruz (UCSC httpgenomeucscedu)hg19 genome build using Pash 30 Bismark 074(Bowtie 1 mode) BSMAP 26 BatMeth 104 and BSSeeker2 For all the mapping algorithms we useddefault parameters as recommended by the authorsexcept the mismatch parameter that was set to 7 as thelibraries had longer read length The uniquely mappedreads from each mapper were further processed usingtheir respective post-processing scripts provided toestimate the percentage coverage and percentage methyla-tion for CpG sites genome-wide All the mapping was per-formed using a compute cluster with 36 nodes each nodecontaining 8 Intel Xeon E5540 CPUs and 24-GB RAMThe command lines used for all the mappers are providedin Supplementary Methods

Data analysis

Identification of genome-wide CpG site coverage andpercentage methylationTo identify the total number of CpG sites and theircoordinates genome-wide (UCSC hg19) an in-housePerl program was used which identified 282 million

e43 Nucleic Acids Research 2014 Vol 42 No 6 PAGE 2 OF 10

CpG sites Each CpG site was counted toward coverage ifthe read depth was 10 as the overall coverage of eachlibrary was 26 The total coverage and overlap of CpGsites between different mappers were calculated usingan in-house Perl program The percentage methylationscatter plots and Pearson correlations (r2) werecomputed using the R package Processed data withtotal number of reads and methylated reads for individualCpG sites are available in GEO (GSE44806)

Identification of 200-bp bins with two CpG sites coverageand percentage methylationTo logically identify an appropriate lsquobinrsquo size to interrogateDNAmethylation we divided the genome into five differentbin sizes from 100 to 500bp with an increment of 100bpand computed the number of CpG sites in each bin and alsothe percentage of CpG site coverage genome-wide Thisresulted in 62 million bins (200-bp bins with at least twoCpG sites) which covered 85 of the total CpG sites andeliminated 60 of the reads as these bins were present onlyin 40 of the genome All bins containing lt4 CpG siteswere considered covered if at least 2 sites were covered by10 reads and those containing 4 CpG sites were con-sidered covered if at least half of them were covered by 10reads The overlap of bins not covered by BSMAP orBismark across four libraries (4-way Venn diagram) wasgenerated using the lsquoVennDiagramrsquo package (24) availablein R To test the significance of correlation between regionscovered by all mappers and not covered by others theabsolute residuals of percentage methylation of individualswere compared using two-tailed t tests in R

Characterization of genomic regionsTo characterize the genomic features of regions covered byall mappers and not covered by others we used DGV StructVar Segmental Dups and RepeatMasker tracks from theUCSC Genome Browser annotation database The per-centage overlap and the extent of overlap of bins withvarious genome features were computed using a combin-ation of BEDtools (25) and in-house Perl scripts The per-centage nucleotide compositions for each library werecomputed using the Bioperl library (httpwwwbioperlorgwikiMain_Page) in-house Perl scripts and R softwareTo test the significance of percentage nucleotide compositionbetween regions covered by all mappers and othercategories for each library we computed the average per-centage nucleotide composition and performed two-tailedpaired t-tests (n=4 libraries) with equal variance in R

Bisulfite-seq read simulation and analysisWe simulated 10 million reads from hg19 using thesoftware RMAP-bs (17) Reads were generated withlength=100 bp and allowing maximum three mismatcheswith bisulfite conversion efficiency of 99 The 10 millionreads were mapped to the hg19 build using BismarkBSMAP and Pash mapping algorithms allowing sevenmismatches

Quantitative verification of DNA methylationTo verify the accuracy of percentage methylation esti-mates of bins (200 bp) not covered by BSMAP we

designed 18 pyrosequencing assays (SupplementaryTable S8) and performed site-specific analysis of CpGmethylation Bisulfite modification and pyrosequencingof the regions were performed as previously described(22) All pyrosequencing assays were first validated forquantitative accuracy by running methylation standardscomposed of known mixtures of completely methylatedand unmethylated human genomic DNA (26)

RESULTS

CpG site level coverage and concordance of the mappers

Bisulfite-seq data sets were generated for PBL and HFDNA from each of two healthy males We chose thesetwo tissues because they represent two different germlayer lineages (mesoderm and ectoderm respectively)We generated an average of 400 million 100-bp paired-end reads for each library achieving 26 averagecoverage per library 95 of the reads were retainedafter adaptor trimming quality score filtering (28) andread length filtering (50 bp) (27) Filtered reads weremapped to the human reference genome UCSC hg19build We mapped the reads as single-end reads forgreatest generalizability We focused our main analysison a comparison of Bismark BSMAP and Pash Foreach library Bismark BSMAP and Pash uniquelymapped 77ndash82 78ndash83 and 81ndash87 of the reads re-spectively (Supplementary Table S1) Analysis of onelibrary (Supplementary Table S2) indicated that gt96of reads were mapped by at least one mapper A bench-marking analysis of mapping and post-processing1 million reads on a single 8-core processor (Table 1)indicated that Bismark and BSMAP are substantiallyfaster than Pash All the algorithms include post-processing scripts to calculate coverage and percentagemethylation at the CpG site levelIn each of the four libraries each of these mapping

algorithms covered gt70 of the 282 million CpG sitesgenome-wide with a read depth of 10 (Figure 1A andSupplementary Figure S1) More than 67 of all CpGsites genome-wide were covered by all three mappers ineach library (Figure 1A and Supplementary Figure S1)Overall the three mappers exhibited excellent concord-ance in CpG site-specific methylation calls (r2 095)(Figure 1BndashD) Contrary to the conjecture of Chatterjee

Table 1 Comparison of mapping and post-processing times (in

seconds) for 1 million reads

Algorithm Mapping Post-processing Total

Bismark 1514 81 1595BSMAP 800 1081 1881Pash 3486 1504 4990BS Seeker 1324 3867 5191BatMeth 904 70 974

Post-processing times include time required to estimate percentagemethylation at each methylated cytosine but do not include time toload the reference genome into memory (required for BSMAP and Pashonly)

PAGE 3 OF 10 Nucleic Acids Research 2014 Vol 42 No 6 e43

et al (19) that lsquowild cardrsquo mappers may be biased towardhighly methylated reads average genome-wide DNAmethylation estimates did not differ appreciably amongthe mappers (eg C01-HF methylation was 783787 and 769 by Bismark BSMAP and Pashrespectively)We also mapped the reads from the four Bisulfite-seq

libraries using BatMeth and BS Seeker AlthoughBatMeth was fast (Table 1) it uniquely mapped only50 of the reads in each library (SupplementaryTable S1) which is comparable with the developersrsquoresults mapping Bisulfite-seq reads from an H1 cell line(17) Owing to its low mapping efficiency we excludedBatMeth from further consideration BS Seeker whichyielded average mapping speed but a long post-processingtime (Table 1) uniquely mapped 80 of the readsfrom each library (Supplementary Table S1) similar tothe other three mapping algorithms Compared withBismark BSMAP and Pash BS Seeker mapping resultswere most concordant with those of Bismark Of all thereads in multiple libraries mapped by BS Seeker gt98were mapped to the same position by Bismark(Supplementary Table S3) Further BS Seeker percentagemethylation calls at individual CpG sites genome-widewere highly correlated with those of Bismark (r2=094)(Supplementary Figure S2) Given that BS Seekerrsquosmapping results are highly concordant with those of

Bismark for clarity and simplicity we will focus subse-quent analyses on Bismark BSMAP and Pash

Coverage of 200-bp bins containing at least two CpG sites

With the goal of drawing the most biologically meaningfulcomparisons among the mapping results and to provide abasis for verification of DNA methylation estimates wesought to identify a logical and appropriate approach toefficiently collapse the site-specific data into genomicregions while still maintaining sufficient resolution to dis-criminate between genomic regions with different methy-lation patterns Focusing on 200-bp lsquobinsrsquo containingat least two CpG sites covers 85 of CpG sites in thehuman genome (Supplementary Figure S3A) whileeliminating from consideration the 60 of reads thatmap to CpG-poor regions (Supplementary Figure S3B)(with commensurate reduction in downstream computa-tional requirements) In addition to efficiently covering themajority of CpG sites in the genome 200-bp bins are bio-logically attractive as they approximate nucleosomalresolution In total we identified 62 million 200-bp binscontaining at least two CpG sites in the hg19 UCSCgenome build (henceforth referred to as lsquobinsrsquo) All binscontaining lt4 CpG sites were considered covered if atleast two sites were covered by at least 10 reads andthose containing gt4 CpG sites were considered coveredif at least half of the CpG sites were covered by at least

Figure 1 All three mappers provide excellent coverage and highly concordant estimates of CpG methylation genome-wide (A) Percentage of CpGsites covered by Pash Bismark and BSMAP and the overlaps among them Each mapping algorithm covers gt80 of the CpG sites and 78 arecovered by all the three mapping algorithms lsquoNot covered by BSMAPrsquo for example indicates the percentage of CpG sites that are covered by Pashand Bismark but not by BSMAP Correlations of CpG site-specific percentage methylation calls among the different mapping algorithms are high(B) Bismark versus Pash (r2=095) (C) BSMAP versus Pash (r2=096) and (D) BSMAP versus Bismark (r2=097) Red yellow and green indicatehigh moderate and low densities respectively All data are for C01-HF library only as an example


10 reads In each of the four libraries gt78 of bins werecovered by all three mapping algorithms (Figure 2A) Inall 8ndash12 of bins were not covered by BSMAP 5ndash6were not covered by Bismark and lt1 of the bins werenot covered by Pash Hence although BSMAP mappedmore reads than Bismark (Supplementary Table S1) moreof the reads mapped by Bismark were within genomicregions containing CpG sites Although failing to cover5ndash10 of bins may not seem like a huge loss we askedto what extent such losses would be compounded whenperforming comparisons across different libraries Amongthe four libraries in our experiment gt18 of the binswere not covered by BSMAP and 9 of bins were notcovered by Bismark in at least one library (Figure 2B andC) Such losses will of course increase with the number ofsamples under comparison Hence the choice of mappingalgorithm can have a substantial impact on the results of acomparative methylome analysis

Choice of the mapping algorithm affects ability to detectinterindividual and tissue-specific methylation differences

Bins that are covered by all mappers (Figure 3A) andthose not covered by Bismark (Figure 3B) exhibited ahigh correlation (r2gt 09) of average percentage methyla-tion between individuals whereas bins not coveredby BSMAP (Figure 3C) showed a lower correlation(r2=084 Plt 1010 compared with those covered by all

mappers) Bins covered by all mappers (Figure 3D)showed substantial tissue-specific variation in methylationcalls between HF and PBL (r2=043) Bins not coveredby Bismark (Figure 3E) showed a slightly but significantlyhigher correlation between HF and PBL (r2=053Plt 1010) indicating that these tend not to be regionsof tissue-specific variation in DNA methylationRemarkably among bins not covered by BSMAP(Figure 3F) there was a significantly lower inter-tissuecorrelation (r2=027 Plt 1010) indicating that regionsin which BSMAP fails to map also tend to be regions oftissue-specific variation Hence at least in the two individ-uals and two tissues we compared regions that weremapped by Bismark and Pash but not by BSMAP wereenriched for both interindividual and tissue-specificvariation in DNA methylation This suggests that thesequence characteristics that render certain genomicregions difficult for BSMAP to map are also associatedwith mechanisms of interindividual and tissue-specificepigenetic regulation

Characterization of genomic regions differentiallycovered by the mappers

We sought to identify genomic features that potentiallyexplain the differential mapping characteristics of thethree algorithms Relative to bins covered by all threemappers those not covered by BSMAP were found to

Figure 2 Genome-wide coverage of 200-bp bins containing 2 CpG sites by different mapping algorithms (A) Percentage of bins covered by allmappers and not covered by individual mappers across all four libraries (C01-HF C01-PBL C02-HF and C02-PBL) More than 78 of the bins arecovered by all three mappers in each library (B) Comparing methylation across four libraries requires that each bin be covered in all four librariesFully 18 of bins are not covered by BSMAP in at least one library and (C) 9 of bins are not covered by Bismark in at least one library


be similar in terms of proportion overlapping with struc-tural variations segmental duplications and repetitiveelements (Figure 4A) Conversely bins covered by Pashand BSMAP and those covered only by Pash were highlyenriched for structural variations and segmental duplica-tions (Figure 4A) For bins that do overlap with thesefeatures the extent of overlap is essentially 100

(Supplementary Figure S4) Although bins covered differ-entially by the mappers showed a similar high prevalenceof repetitive elements (Figure 4A) SINE elements wereenriched in bins not covered by Bismark and depleted inbins not covered by BSMAP (Supplementary Figure S5)We next evaluated nucleotide composition of the bins notcovered by different mappers (complete details inSupplementary Table S4) Bins covered by all threemappers showed equal percentage composition (25each) of A T G and C (Figure 4B) comparable withthat in bins not covered by Bismark (Figure 4C)However bins not covered by BSMAP were highlyenriched in T (P=165 105) and depleted in G(P=79 106 Figure 4D) a similar but less dramaticpattern was found in regions covered only by Pash(Figure 4E) Together these data indicate that regionscalled as uniquely mapped by BSMAP and Pash butnot by Bismark are largely associated with structural vari-ations and segmental duplications Regions mapped byBismark and Pash but not by BSMAP on the otherhand are characterized most strikingly by low G content

Mapping simulated reads confirms unique ability of Pash

Our analysis (Figure 4A) suggests that BSMAP and Pashcan uniquely map reads within regions of structural vari-ation and segmental duplication that are not covered byBismark Because we performed mapping at relatively lowstringency (allowing up to seven mismatches) an alterna-tive explanation is that some of the lsquouniquersquo mappingsby BSMAP and Pash are incorrect Of all readsmapped gt95 included only two or fewer mismatches(Supplementary Table S5) suggesting that our low strin-gency did not have a dramatic effect on mappingaccuracy To test this directly we performed a mappingexperiment using simulated reads We simulated10 million 100-bp reads from hg19 and mapped themtwice by each of Bismark BSMAP and Pashallowing up to three or seven mismatches respectivelyThe number of mismatches allowed had essentially noeffect on mapping efficiency (Supplementary Table S6)so we focused on the results based on up to sevenmismatches (the same stringency we used in mapping thelibraries) Pash mapped fewer of the simulated readsoverall (78 M versus 93 M for the other two mappers)but nearly the same number within the 200-bp bins (ieregions containing most of the CpG sites) (SupplementaryTable S7) More than 995 of the mappings by Bismarkand BSMAP were accurate compared with 95 forPash (Supplementary Table S7) Nonetheless when wecharacterized the genomic features of the regionsmapped differentially by the three algorithms weobtained results strikingly similar to those in Figure 4ABins covered only by Pash were twice as likely to overlapwith structural variation and 20 times as likely to overlapwith segmental duplication in particular compared withbins covered by all three mappers (SupplementaryFigure S6) However the simulation experiment did notconfirm a special ability of BSMAP to map in theseregions (compare lsquonot covered by Bismarkrsquo in Figure 4Aand Supplementary Figure S6)

Figure 3 Evaluation of interindividual and tissue-specific variation ofpercentage methylation according to different mapping algorithms(AndashC) Correlation of percentage methylation across individuals ac-cording to mapping category (A) Average percentage methylation(per bin across all mappers) is highly concordant in individual 2(C02) versus individual 1 (C01) (r2=091) (B) For bins not coveredby Bismark interindividual correlation (r2=091) is comparable withthat across all mappers (P=052) [Note Statistical significance isindicating whether the correlation shown is different from that in(A)] (C) For bins not covered by BSMAP interindividual correlationis reduced [r2=084 significantly lower than in bins covered by allmappers (Plt 1010)] (DndashF) Correlation of percentage methylationbetween tissues (PBL versus HF) according to different mapping algo-rithms (D) Bins covered by all mappers show substantial tissue-specificvariation (r2=043) (E) For bins not covered by Bismark inter-tissuecorrelation is significantly higher (r2=053 Plt 1010 relative to thosecovered by all mappers) [Statistical significance is indicating whetherthe correlation shown is different from that in (D)] (F) For bins notcovered by BSMAP inter-tissue correlation is significantly lower(r2=027 Plt 1010 relative to those covered by all mappers)


Quantitative verification of regions by pyrosequencing

In bins that are covered by some mappers but not byothers there are two potential scenarios The methylationcalls may be reliable or some mappers may map morepromiscuously than others providing erroneous methyla-tion calls Without independent quantitation of DNAmethylation in the sequenced samples it is impossible todistinguish between these two possibilities Therefore weverified regional DNA methylation estimates by quantita-tive bisulfite pyrosequencing Because most of the regionsnot mapped by Bismark or Pash overlap with structuralvariations and segmental duplications (Figure 4A) wewere unable to design pyrosequencing assays for theseregions Therefore our verification focused on regionsnot covered by BSMAP We designed 18 pyrosequencing

assays by selecting among bins showing low medium andhigh levels of methylation as well as those showing tissue-specific variation All pyrosequencing assays were firstvalidated for quantitative accuracy by running methyla-tion standards composed of known mixtures of completelymethylated and unmethylated human genomic DNA(2628) Of 18 assays designed 4 were found to be unreli-able Among the remaining 14 the percentage methylationestimates by Bismark and Pash agreed remarkably wellwith the pyrosequencing data (Figure 5) with just a fewregions exhibiting modestly overestimated (Figure 5D Eand K) or underestimated methylation calls (Figure 5F Gand J) These data clearly indicate that in the bins thatwere not mapped by BSMAP Bismark and Pash wereable to map correctly and provide reliable estimates ofDNA methylation

DISCUSSION

This is the first study comparing several Bisulfite-seqmapping algorithms on a large data set across multiplesequencing libraries with biologically meaningful samplevariation Our goal was not to perform a comprehensivecomparison of the many published mapping algorithmsbut rather to determine the extent to which mappingcharacteristics of several commonly used algorithms mayaffect experimental outcomes Previous studies reportingperformance of Bisulfite-seq mapping algorithms(8ndash111617) used only a small number of real publicdata (2ndash15 million reads) or simulated data (1 millionreads) to compare mapping efficiency The only othermapper comparison study using independently generatedsequence data (19) was limited to reduced representationbisulfite sequencing and focused mainly on the perform-ance of mapping efficiency as a function of read lengthand coverage Those authors generated independentdata but did not verify the percentage methylationlevels quantitatively They observed that all the alignerscovered gt80 of the CpG sites within the reduced repre-sentation (RR) genome consistent with our findingsgenome-wideAlthough Bisulfite-seq offers single CpG resolution

most whole-genome Bisulfite-seq studies analyze the dataon a lower level of resolution using eg a 1-kb slidingwindow with a 100-bp step (3) or a 2- or 5-kb tilingwindow (529) However no previous studies have at-tempted to identify an optimal resolution based on anintegrated analysis of CpG density and genomic contentacross various bin sizes The approach we propose hereselecting 200-bp bins with at least two CpG sitescombines excellent genomic coverage (85 of CpG sitesgenome-wide) and high resolution while reducing down-stream computational requirements by eliminating fromconsideration 60 of the reads mapping into CpG-poorregions (Supplementary Figure S3) This approach ifadopted broadly could simplify direct comparisonsamong various Bisulfite-seq studies Moreover thesehighly informative regions could also provide a basis forBisulfite-seq-targeted enrichment reagents (30) substan-tially decreasing sequencing requirements

Figure 4 Characterization of genomic regions differently covered by thethree mapping algorithms (all four libraries combined) (A) Percentage ofcovered 200-bp bins overlapping with different genomic features bymapper category Compared with regions covered by all three mappersthose not covered by Bismark and covered only by Pash are highlyenriched for overlap with segmental duplications and structural vari-ations In regions covered by all mappers (B) and in those not coveredby Bismark (C) percentage nucleotide compositions are all equal (A redT blue G green C purple) (D) Regions not covered by BSMAP areenriched for lsquoTrsquo and depleted of lsquoGrsquo nucleotides (E) Regions covered onlyby Pash have an under-representation of lsquoGrsquo nucleotides


The goal of most Bisulfite-seq experiments is to drawcomparisons across multiple samples Our mapper com-parison study is the first to encompass multiple Bisulfite-seq libraries providing the unique opportunity to quantifythe degree of mapping losses as the sample number in-creases Our analysis showed that in each library up to10 of bins were covered by Bismark and Pash but not byBSMAP however when attempting to draw comparisonsacross just four libraries this loss of data escalated to 18of bins genome-wide (Figure 2B) Interestingly theseregions are significantly enriched for both interindividual(Figure 3C) and tissue-specific variation (Figure 3F) Thisindicates that genomic regions in which BSMAP fails to

map may be of particular interest with respect to biologic-ally meaningful variation in DNA methylation Notablythese regions do not overlap with structural variation andsegmental duplication (Figure 4A) but tend to be low inG and rich in T bases (Figure 4D) Hence these regionsare likely difficult to map because bisulfite conversionintroduces additional T bases leading to severelydecreased sequence complexity Subtle distinctions in thealignment strategies of the mappers likely explain thedifferences in performance Both Bowtie1 and SOAPalign sequences using a primary seed at the start of thesequence Therefore potential differences in performancebetween Bismark and BSMAP likely stem from how the

Figure 5 Verification of percentage methylation by quantitative bisulfite pyrosequencing in bins not covered by BSMAP (AndashC) Regions in whichBismark and Pash found low percentage methylation in all four libraries (DndashF) Regions in which Bismark and Pash found tissue-specific variation(ie low in HF and higher in PBL) (GndashI) Regions in which Bismark and Pash found tissue-specific variation (ie high in HF and lower in PBL)(JndashN) Regions showing medium to high percentage methylation across all four libraries Overall the percentage methylation measured by quanti-tative pyrosequencing compared favorably with the estimates obtained by Bisulfite-seq


underlying mappers are used BSMAP considers multipleseeds for T-rich sites and might not initiate mappings incomplex regions whereas in the case of Bismark themapping is done on the three-letter alphabet exploringall possible mappings Pash first identifies good anchors(perhaps without T-converted base pairs) throughout aread and then extends the alignment to the entire readHowever BSMAP attempts to seed an alignment only atthe beginning of each read and uses heuristics that abortsearches in highly ambiguous regions to maximizemapping speed This lsquoprimary seedingrsquo could explain thelower coverage we obtained using BSMAP By generatingour own Bisulfite-seq libraries we had the ability tofollow up with quantitative verification Our quantitativebisulfite pyrosequencing showed excellent agreement withBisulfite-seq percentage methylation estimates (Figure 5)demonstrating that regions in which BSMAP failed tomap are mapped correctly by the other mappers

Regions covered by BSMAP and Pash but notby Bismark showed a high prevalence of overlap withstructural variation particularly segmental duplications(Figure 4A) (However owing to their non-uniquenessverification of methylation calls in these regions bypyrosequencing was not feasible) Based on their abilityto tolerate mismatches and find key anchoring base pairslsquowild cardrsquo algorithms should in theory outperform theirlsquothree-letterrsquo counterparts in ambiguous regions such assegmental duplications and repeats Moreover BurrowsndashWheeler-based aligners such as Bowtie (which Bismarkincorporates) achieve high speeds by relying on genomicindices with lossy compression and thus could ignorepotential unique alignments The performance of Pash inmapping duplicative regions containing only smallamounts of unique sequence similarity has been exten-sively validated (8) Here using simulated reads (inwhich we know whence the reads originated) we con-firmed that Pash has an exceptional ability to mapBisulfite-seq reads in regions of structural variation(Supplementary Figure S6) However the simulation didnot confirm this ability for BSMAP Although Pash andBSMAP are both using the lsquowild-cardrsquo strategy they usedifferent approaches Pash explores the various k-mersin the reads oblivious to the complexity of the genomicsequences whereas BSMAP hashes multiple seeds forgenomic locations and might not index for readmapping regions of high complexity (eg with a largenumber of CGs and potential CT ambiguities in thesequenced reads) It should be noted that our comparisonwas focused on the human genome and included only100-bp reads mapped as single-end reads Comparisonresults may be different in other species for other readlengths or using paired-end mapping

CONCLUSIONS

In summary we have completed a comparative analysisof five read mappers for methylome mapping Althoughmethylation calls derived using these widely used mappersare highly concordant we identified significant and im-portant differences in their performance in specific types

of genomic regions In particular Bismark provides anattractive combination of processing speed genomiccoverage and quantitative accuracy whereas Pashalthough computationally more demanding offers consid-erably higher genomic coverage owing to its ability to mapwithin regions of structural variation BS Seeker yieldsmapping results similar to those of Bismark butprovides more detailed information in post processingwhich may be attractive to some users We hope ourresults will help investigators select the Bisulfite-seqmapping algorithm with optimal performance character-istics for their project and provide useful guidance towardthe development of the next generation of mapping tools

ACCESSION NUMBERS

The raw sequence reads for all four libraries have beendeposited in fastq format in GEO [GSE44806]

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

FUNDING

The National Institutes of Health RoadmapEpigenomics Program [U01DA025956 to AM andRAW] National Institutes of HealthmdashNIDDK[1R01DK081557] United States Department ofAgriculture [CRIS 6250-51000-055 to RAW] Fundingfor open access charge National Institutes of HealthRoadmap Epigenomics Program [U01 DA025956]

Conflict of interest statement AM receives royalties fromand participates in the commercial licensing of the Pashprogram

REFERENCES

1 JonesPA (2012) Functions of DNA methylation islands startsites gene bodies and beyond Nat Rev Genet 13 484ndash492

2 ClarkSJ HarrisonJ PaulCL and FrommerM (1994) Highsensitivity mapping of methylated cytosines Nucleic Acids Res22 2990ndash2997

3 ListerR PelizzolaM DowenRH HawkinsRD HonGTonti-FilippiniJ NeryJR LeeL YeZ NgoQM et al(2009) Human DNA methylomes at base resolution showwidespread epigenomic differences Nature 462 315ndash322

4 MeissnerA MikkelsenTS GuH WernigM HannaJSivachenkoA ZhangX BernsteinBE NusbaumC JaffeDBet al (2008) Genome-scale DNA methylation maps of pluripotentand differentiated cells Nature 454 766ndash770

5 SeisenbergerS AndrewsS KruegerF ArandJ WalterJSantosF PoppC ThienpontB DeanW and ReikW (2012)The dynamics of genome-wide DNA methylation reprogrammingin mouse primordial germ cells Mol Cell 48 849ndash862

6 BockC (2012) Analysing and interpreting DNA methylationdata Nat Rev Genet 13 705ndash719

7 SmithAD ChungWY HodgesE KendallJ HannonGHicksJ XuanZ and ZhangMQ (2009) Updates to the RMAPshort-read mapping software Bioinformatics 25 2841ndash2842

8 CoarfaC YuF MillerCA ChenZ HarrisRA andMilosavljevicA (2010) Pash 30 A versatile software package forread mapping and integrative analysis of genomic and epigenomic


variation using massively parallel DNA sequencing BMCBioinformatics 11 572

9 XiY and LiW (2009) BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 10 232

10 ChenPY CokusSJ and PellegriniM (2010) BS Seeker precisemapping for bisulfite sequencing BMC Bioinformatics 11 203

11 KruegerF and AndrewsSR (2011) Bismark a flexible alignerand methylation caller for Bisulfite-Seq applicationsBioinformatics 27 1571ndash1572

12 PedersenB HsiehTF IbarraC and FischerRL (2011)MethylCoder software pipeline for bisulfite-treated sequencesBioinformatics 27 2435ndash2436

13 CoarfaC and MilosavljevicA (2008) Pash 20 scaleablesequence anchoring for next-generation sequencing technologiesPacific Symposium on Biocomputing 13 102ndash113

14 MillerCA HamptonO CoarfaC and MilosavljevicA (2011)ReadDepth a parallel R package for detecting copy numberalterations from short sequencing reads PLoS One 6 e16327

15 CokusSJ FengS ZhangX ChenZ MerrimanBHaudenschildCD PradhanS NelsonSF PellegriniM andJacobsenSE (2008) Shotgun bisulphite sequencing of theArabidopsis genome reveals DNA methylation patterning Nature452 215ndash219

16 CampagnaD TelatinA ForcatoC VituloN and ValleG(2013) PASS-bis a bisulfite aligner suitable for whole methylomeanalysis of Illumina and SOLiD reads Bioinformatics 29268ndash270

17 LimJQ TennakoonC LiG WongE RuanY WeiCL andSungWK (2012) BatMeth improved mapper for bisulfitesequencing reads on DNA methylation Genome Biol 13 R82

18 OttoC StadlerPF and HoffmannS (2012) Fast and sensitivemapping of bisulfite-treated sequencing data Bioinformatics 281698ndash1704

19 ChatterjeeA StockwellPA RodgerEJ and MorisonIM(2012) Comparison of alignment software for genome-widebisulphite sequence data Nucleic Acids Res 40 e79

20 LangmeadB TrapnellC PopM and SalzbergSL (2009)Ultrafast and memory-efficient alignment of short DNAsequences to the human genome Genome Biol 10 R25

21 LiR LiY KristiansenK and WangJ (2008) SOAP shortoligonucleotide alignment program Bioinformatics 24 713ndash714

22 WaterlandRA KellermayerR LaritskyE Rayco-SolonPHarrisRA TravisanoM ZhangW TorskayaMS ZhangJShenL et al (2010) Season of conception in rural gambia affectsDNA methylation at putative human metastable epialleles PLoSGenet 6 e1001252

23 MartinM (2011) Cutadapt removes adapter sequences fromhigh-throughput sequencing reads EMBnet J 17 10ndash12

24 ChenH and BoutrosPC (2011) VennDiagram a package forthe generation of highly-customizable Venn and Euler diagramsin R BMC Bioinformatics 12 35

25 QuinlanAR and HallIM (2010) BEDTools a flexible suite ofutilities for comparing genomic features Bioinformatics 26841ndash842

26 ShenL GuoY ChenX AhmedS and IssaJP (2007)Optimizing annealing temperature overcomes bias in bisulfitePCR methylation analysis Biotechniques 42 48ndash58

27 KruegerF KreckB FrankeA and AndrewsSR (2012) DNAmethylome analysis using short bisulfite sequencing data NatMethods 9 145ndash151

28 HarrisRA WangT CoarfaC NagarajanRP HongCDowneySL JohnsonBE FouseSD DelaneyA ZhaoYet al (2010) Comparison of sequencing-based methods to profileDNA methylation and identification of monoallelic epigeneticmodifications Nat Biotechnol 28 1097ndash1105

29 HansenKD LangmeadB and IrizarryRA (2012) BSmoothfrom whole genome bisulfite sequencing reads to differentiallymethylated regions Genome Biol 13 R83

30 IvanovM KalsM KacevskaM MetspaluA Ingelman-SundbergM and MilaniL (2013) In-solution hybrid capture ofbisulfite-converted DNA for targeted bisulfite sequencing of 174ADME genes Nucleic Acids Res 41 e72


Bisulfite-seq mapping algorithms are mainly used toestimate percentage methylation at specific CpG sites(methylation calls) but also provide the ability to callsingle nucleotide and small indel variants (13) and copynumber and structural variants (14) The analyses in thisarticle focus exclusively on issues relevant to methylationcalls clearly an algorithmrsquos ability to map reads invarious sequence contexts and make accurate methylationcalls may have profound implications for the interpret-ation of Bisulfite-seq experiments Descriptions of newBisulfite-seq mapping tools typically compare globalmetrics such as proportion of uniquely mapped readsglobal percentage methylation computational require-ments and running time Benchmarking studies havebeen performed using real data downloaded from publicdatabases [mainly human (3) or plant data (15)] simulateddata (16) or combinations of both (10121718) Howevernone of these previous studies performed independentquantitative verification of methylation calls The onlyprevious comparison of Bisulfite-seq mapping algorithmsusing independently generated sequencing data involved asingle reduced representation bisulfite sequencing data setobtained from one human sample (19) That studyevaluated the mapping efficiency of Bismark BSMAPand RMAPBS (7) as a function of read length andadaptor sequences They also compared the totalnumber of methylated CpG sites genome-wide but didnot compare overlap of CpG sites covered by the threemappers or assess accuracy of methylation callsHence although proper analysis and interpretation of

the increasing number of expensive Bisulfite-seq datasets critically depends on their performance one mayconclude that widely used mapping methods remainpoorly characterized To address this need we selectedthree mapping algorithms for detailed comparisonBismark (11) BSMAP (9) and Pash (8) which useBowtie (20) SOAP (21) and in-house aligners respect-ively Bismark and BSMAP are the most widely usedthree-letter and wild card mapping algorithms respect-ively (6) Pash performs a heuristic alignment of k-mermatches To complement this main comparison we alsoevaluated the performance of two additional mapping al-gorithms BS Seeker (10) and BatMeth (17) We generatedfour human methylomes (representing two tissues fromeach of two individuals) and mapped the Bisulfite-seqreads independently using the five algorithms Despitegenerally excellent concordance of the mapping resultsour analysis (and subsequent independent verification byquantitative bisulfite pyrosequencing) highlights import-ant differences among the mapping algorithms

MATERIALS AND METHODS

Sample collection Bisulfite-seq library preparationand sequencing

Two tissue samples peripheral blood lymphocyte (PBL)and hair follicle (HF) from two healthy male adults (C01and C02) were collected in accordance with institutionalIRB regulations HFs (30ndash50) were obtained by pluckingscalp or facial hair from the same individuals PBLs were

isolated by Ficoll gradient centrifugation Tissueswere stored at 80C until isolation of genomic DNAby proteinase-k digestion and phenolndashchloroformextraction (22)

Illumina libraries were generated according to themanufacturerrsquos sample preparation protocol for genomicDNA Approximately 1 mg of genomic DNA was frag-mented to 200ndash500 bp and end-repaired The 50-ends ofDNA fragments were phosphorylated and a singleadenine base was added to the 30-end Illumina adaptorswere ligated to the genomic DNA

Bisulfite modification was performed using the EZDNA Methylation-Direct kit (Zymo Research) accordingto the manufacturerrsquos instructions The bisulfite-modifiedDNA was amplified by using adaptor-specific primersand fragments of 200ndash500 bp were isolated by bead puri-fication The quantity and size distribution of sequencinglibraries were determined using the Pico Green fluores-cence assay and the Agilent 2100 Bioanalyzer respect-ively The DNA was sequenced on the Illumina HiSeq2000 as 100-bp paired-end reads following the manufac-turerrsquos protocols

Reads quality control and mapping

For each library the standard Illumina pipeline was usedto perform base calling and the results were generated infastq format Quality control of reads was accessed byrunning the FastQC program FastQC (httpwwwbioinformaticsbabrahamacukprojectsfastqc) helps to de-termine the best quality score and the right read length fortrimming as the signals decline while Illumina cyclesprogress Because the majority of the bases had qualityscores 28 we decided to use a quality score filteringvalue 28 and read length 50 bp The Cutadapt (23)program was used to trim adaptor sequences andperform quality score and read length filtering Thisresulted in gt85 of the reads for mapping and analysis

QC-passed reads were mapped to the University ofCalifornia Santa Cruz (UCSC httpgenomeucscedu)hg19 genome build using Pash 30 Bismark 074(Bowtie 1 mode) BSMAP 26 BatMeth 104 and BSSeeker2 For all the mapping algorithms we useddefault parameters as recommended by the authorsexcept the mismatch parameter that was set to 7 as thelibraries had longer read length The uniquely mappedreads from each mapper were further processed usingtheir respective post-processing scripts provided toestimate the percentage coverage and percentage methyla-tion for CpG sites genome-wide All the mapping was per-formed using a compute cluster with 36 nodes each nodecontaining 8 Intel Xeon E5540 CPUs and 24-GB RAMThe command lines used for all the mappers are providedin Supplementary Methods

Data analysis

Identification of genome-wide CpG site coverage andpercentage methylationTo identify the total number of CpG sites and theircoordinates genome-wide (UCSC hg19) an in-housePerl program was used which identified 282 million








RESULTS


































DISCUSSION











CONCLUSIONS



ACCESSION NUMBERS


SUPPLEMENTARY DATA


FUNDING



REFERENCES








































RESULTS


































DISCUSSION











CONCLUSIONS



ACCESSION NUMBERS


SUPPLEMENTARY DATA


FUNDING



REFERENCES


























































DISCUSSION











CONCLUSIONS



ACCESSION NUMBERS


SUPPLEMENTARY DATA


FUNDING



REFERENCES








































CONCLUSIONS



ACCESSION NUMBERS


SUPPLEMENTARY DATA


FUNDING



REFERENCES


































comparison and quantitative verification of mapping algorithms … · 2017-04-14 · comparison and...

Documents