pscanchip: finding over-represented transcription factor-binding site

9
PscanChIP: finding over-represented transcription factor-binding site motifs and their correlations in sequences from ChIP-Seq experiments Federico Zambelli 1 , Graziano Pesole 2,3 and Giulio Pavesi 1, * 1 Dipartimento di Bioscienze, Universita ` di Milano, Via Celoria 26, 20133 Milano, Italy, 2 Istituto di Biomembrane e Bioenergetica, Consiglio Nazionale delle Ricerche, Via Amendola 165/A, 70126 Bari, Italy and 3 Dipartimento di Bioscienze, Biotecnologie e Biofarmaceutica, Universita ` di Bari, Via Orabona 4, 70124 Bari, Italy Received February 17, 2013; Revised May 1, 2013; Accepted May 2, 2013 ABSTRACT Chromatin immunoprecipitation followed by sequencing with next-generation technologies (ChIP-Seq) has become the de facto standard for building genome-wide maps of regions bound by a given transcription factor (TF). The regions iden- tified, however, have to be further analyzed to deter- mine the actual DNA-binding sites for the TF, as well as sites for other TFs belonging to the same TF complex or in general co-operating or interacting with it in transcription regulation. PscanChIP is a web server that, starting from a collection of genomic regions derived from a ChIP-Seq experi- ment, scans them using motif descriptors like JASPAR or TRANSFAC position-specific frequency matrices, or descriptors uploaded by users, and it evaluates both motif enrichment and positional bias within the regions according to different measures and criteria. PscanChIP can successfully identify not only the actual binding sites for the TF investigated by a ChIP-Seq experiment but also secondary motifs corresponding to other TFs that tend to bind the same regions, and, if present, precise positional correlations among their respective sites. The web interface is free for use, and there is no login re- quirement. It is available at http://www.beaconlab. it/pscan_chip_dev. INTRODUCTION The regulation of eukaryotic gene transcription is a complex process, which depends on interactions between transcription factors (TFs) and their binding sites on DNA (TFBSs), as well as on chromatin structure and other epigenetic factors like DNA methylation (1). Chromatin immunoprecipitation (ChIP), followed by the application of next-generation sequencing technologies to the immunoprecipitated DNA (ChIP-Seq), permits to build genome-wide maps of protein–DNA interactions, like TF binding, histone modifications or any other protein involved in the process (2). The typical output of a ChIP-Seq experiment for a TF (the ChIP’ed TF, hence, on) is a list of thousands of genomic regions with associated enrichment scores and corresponding P-values or false discovery rates. The plot of the enrichment (sequence read density) against the genome shows for these regions a ‘peak’ shape (Supplementary Figure S1), usually with a clear local maximum (‘summit’) point (3). As these regions may range in length from hundreds to thousands of base pairs (bp), further bioinformatic sequence analysis is still required to identify the actual TF-binding sites within them (3), either by de novo motif discovery (4) or by using descriptors (5–8) like position weight matrices [PWMs or motif profiles, (9)] available in databases like TRANSFAC (10) and JASPAR (11). These methods usually assess motif enrichment by comparing motif counts in the regions with background random models, which are essential to obtain reliable significance measures. However, the resolution offered by modern next-generation sequencing technologies is such that binding sites for the ChIP’ed TF are typically located in close proximity of the summit points. Motif discovery and enrichment analyses can take advantage from this fact, confining the analysis to 100–150 bp around them (3) or including positional bias in the measures used to assess motif significance (4). On the other hand, detecting not only the sites bound by the ChIP’ed TF but also secondary motifs for putative co-factors is a point of the utmost im- portance, as reported by recent large-scale studies (12,13) that unveiled widespread interactions among TFs and precise arrangements of their binding sites. *To whom correspondence should be addressed. Tel: +39 50314884; Fax:+39 50315042; Email: [email protected] Present address: Giulio Pavesi, Dipartimento di Bioscienze, Universita` di Milano, Via Celoria 26, 20133 Milano, Italy. Published online 7 June 2013 Nucleic Acids Research, 2013, Vol. 41, Web Server issue W535–W543 doi:10.1093/nar/gkt448 ß The Author(s) 2013. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] Downloaded from https://academic.oup.com/nar/article-abstract/41/W1/W535/1105963 by guest on 10 April 2018

Upload: ngothu

Post on 12-Feb-2017

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: PscanChIP: finding over-represented transcription factor-binding site

PscanChIP: finding over-represented transcriptionfactor-binding site motifs and their correlations insequences from ChIP-Seq experimentsFederico Zambelli1, Graziano Pesole2,3 and Giulio Pavesi1,*

1Dipartimento di Bioscienze, Universita di Milano, Via Celoria 26, 20133 Milano, Italy, 2Istituto di Biomembranee Bioenergetica, Consiglio Nazionale delle Ricerche, Via Amendola 165/A, 70126 Bari, Italy and 3Dipartimentodi Bioscienze, Biotecnologie e Biofarmaceutica, Universita di Bari, Via Orabona 4, 70124 Bari, Italy

Received February 17, 2013; Revised May 1, 2013; Accepted May 2, 2013

ABSTRACT

Chromatin immunoprecipitation followed bysequencing with next-generation technologies(ChIP-Seq) has become the de facto standard forbuilding genome-wide maps of regions bound by agiven transcription factor (TF). The regions iden-tified, however, have to be further analyzed to deter-mine the actual DNA-binding sites for the TF, as wellas sites for other TFs belonging to the same TFcomplex or in general co-operating or interactingwith it in transcription regulation. PscanChIP is aweb server that, starting from a collection ofgenomic regions derived from a ChIP-Seq experi-ment, scans them using motif descriptors likeJASPAR or TRANSFAC position-specific frequencymatrices, or descriptors uploaded by users, and itevaluates both motif enrichment and positional biaswithin the regions according to different measuresand criteria. PscanChIP can successfully identify notonly the actual binding sites for the TF investigatedby a ChIP-Seq experiment but also secondary motifscorresponding to other TFs that tend to bind thesame regions, and, if present, precise positionalcorrelations among their respective sites. The webinterface is free for use, and there is no login re-quirement. It is available at http://www.beaconlab.it/pscan_chip_dev.

INTRODUCTION

The regulation of eukaryotic gene transcription is acomplex process, which depends on interactions betweentranscription factors (TFs) and their binding sites onDNA (TFBSs), as well as on chromatin structure andother epigenetic factors like DNA methylation (1).

Chromatin immunoprecipitation (ChIP), followed by theapplication of next-generation sequencing technologies tothe immunoprecipitated DNA (ChIP-Seq), permits tobuild genome-wide maps of protein–DNA interactions,like TF binding, histone modifications or any otherprotein involved in the process (2).The typical output of a ChIP-Seq experiment for a TF

(the ChIP’ed TF, hence, on) is a list of thousands ofgenomic regions with associated enrichment scores andcorresponding P-values or false discovery rates. The plotof the enrichment (sequence read density) against thegenome shows for these regions a ‘peak’ shape(Supplementary Figure S1), usually with a clear localmaximum (‘summit’) point (3). As these regions mayrange in length from hundreds to thousands of basepairs (bp), further bioinformatic sequence analysis is stillrequired to identify the actual TF-binding sites withinthem (3), either by de novo motif discovery (4) or byusing descriptors (5–8) like position weight matrices[PWMs or motif profiles, (9)] available in databases likeTRANSFAC (10) and JASPAR (11). These methodsusually assess motif enrichment by comparing motifcounts in the regions with background random models,which are essential to obtain reliable significancemeasures. However, the resolution offered by modernnext-generation sequencing technologies is such thatbinding sites for the ChIP’ed TF are typically located inclose proximity of the summit points. Motif discovery andenrichment analyses can take advantage from this fact,confining the analysis to 100–150 bp around them (3) orincluding positional bias in the measures used to assessmotif significance (4). On the other hand, detecting notonly the sites bound by the ChIP’ed TF but also secondarymotifs for putative co-factors is a point of the utmost im-portance, as reported by recent large-scale studies (12,13)that unveiled widespread interactions among TFs andprecise arrangements of their binding sites.

*To whom correspondence should be addressed. Tel: +39 50314884; Fax: +39 50315042; Email: [email protected] address:Giulio Pavesi, Dipartimento di Bioscienze, Universita di Milano, Via Celoria 26, 20133 Milano, Italy.

Published online 7 June 2013 Nucleic Acids Research, 2013, Vol. 41, Web Server issue W535–W543doi:10.1093/nar/gkt448

� The Author(s) 2013. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercialre-use, please contact [email protected]

Downloaded from https://academic.oup.com/nar/article-abstract/41/W1/W535/1105963by gueston 10 April 2018

Page 2: PscanChIP: finding over-represented transcription factor-binding site

Indeed, the latest methods using PWMs designed forChIP-Seq, like CENTDIST (14) and CentriMo (15) alto-gether bypass the definition of a suitable ‘genome-wide’background, and they do not evaluate actual motif enrich-ment in the input regions, but rather ‘maximum centralenrichment’. That is, they determine whether putativebinding sites described by a given PWM matching itabove a given or estimated threshold value tend tocluster toward the center of the regions, and if the cluster-ing is significant assuming a random uniform distribution.However, in this way, actual enrichment of the motifs inthe regions with respect to the rest of the genome is notexplicitly evaluated, making harder to assess, in case ofdifferent motifs presenting the same bias, which one canbe considered to be the most likely candidate to describethe binding of the TF studied and/or its most significantco-factors. The latter could in fact be present only in alimited subset of the input regions (and, hence, the pos-itional bias because of a limited number of motif in-stances), or vice versa, secondary motifs actuallyenriched might show a much less marked clusteringaround the peak summit and thus deemed to be lesssignificant.

MATERIALS AND METHODS

To address the issues just discussed, we developedPscanChIP, based on our previous tool for promoteranalysis Pscan (5). In PscanChIP we introduce differentcriteria to assess motif overrepresentation and positionalbias in analysis of ChIP-Seq peak regions. Our idea is thatmotifs actually corresponding to TF binding should befirst of all overrepresented in the ChIP-Seq regions withrespect to the rest of the genome (that is, the portion of thegenome accessible for TF binding), and the same shouldhold true for motifs corresponding to other TFs interact-ing with the ChIP’ed one in a significant number ofregions. Also, matches to a given PWM should presentnot only a positional bias, but also yield higher scores inthe preferred positions, that is, they should better fit thematrix in the positions where they tend to cluster.PscanChIP takes as input a set of genomic coordinates

corresponding to ChIP-Seq peaks, assuming that they arecentered on their summits and considers the 150 bpgenomic regions around their center. The regions arescanned on both strands by PWMs available in theJASPAR and TRANSFAC databases, and optionallywith additional PWMs that can be submitted by users.The scan returns for each oligo in the input regions ascore between 0 and 1 (Supplementary Methods), andthe best (highest scoring) oligo is selected from eachinput region. As in the original Pscan, we bypass theneed to define matching thresholds for PWMs to predictlikely TFBS instances, comparing instead for each matrixthe mean score of the best-matching oligos in the inputregions to expected values defined according to differentbackgrounds. TFs mostly bind DNA in accessible regions(12), and genome-wide maps of open chromatin can bebuilt through experiments like DNaseI- or FAIRE-Seq(16). The result is a collection of segments of DNA that

are accessible to regulatory factors and other DNA inter-acting molecules. Thus, given a PWM, we compare thematching scores on the input sequences with an expectedbackground value computed by considering the matchingscores in a collection of DNA accessible regions of thesame length. In particular, we used for this task theENCODE regions available at the UCSC GenomeBrowser database (17), which have been identifiedthrough DNaseI Digital Genomic Footprinting (18). Theregions differ according to the tissue/cell line studied, andthus yield cell-line–specific expected values. Therefore, asinput, users have to select also the cell line/tissue on whichtheir ChIP experiment has been performed, or the closestrelative in the list of available possibility (e.g. HepG2 forliver cells). But, if no suitable choice is available, we alsoincluded a ‘mixed’ background, made by a sample of non-overlapping accessible regions built by taking at random asubset of the regions from each of the cell lines available,and a ‘promoter’ background, to be used if the inputcomes mostly from promoter regions. Each of the back-ground sequence sets (cell-line specific or mixed) is madeof �200 000 regions.

Once the matching scores for a PWM are available bothin the input and the background sequence sets, global en-richment can be computed, by comparing with a t-testaverage and standard deviation of the best-match scorein the input regions with average and standard deviationof best matches on the background sequence set, yieldingenrichment P-values with a two-tailed t distribution.Global enrichment can be used to identify motifs that,in general, are overrepresented in the regions, but withno assumption on their location within the regions orany positional bias with respect to the peak summits.Sites for the ChIP’ed TF should anyway be the most sig-nificantly enriched according to this measure, and alsoPWMs for other ‘co-regulating’ TFs, that is, binding asignificant number of the input regions, should yield lowP-values.

Also, PscanChIP evaluates local enrichment, comparingwith a t-test mean and standard deviation of the score ofthe best matching oligos in the input regions to mean andstandard deviation of the best match in the genomicregions flanking the input ones. Local enrichment can beused to identify motifs with significant preference forbinding within the regions, that is, the motif correspond-ing to the ChIP’ed TF, as well as other TFs likely tointeract with it and binding in its neighborhood. Itshould be noticed that with respect to similar methods,local enrichment not only assesses positional bias butalso how well the matches fit the profile used, withoutestablishing a pre-determined matching threshold.Finally, PscanChIP evaluates motif positional biaswithin the input regions, by splitting them intooverlapping sub-regions of 10 bp. Mean and standard de-viation for the best match score within each sub-region arecompared with mean and standard deviation across all thesub-regions, again with a t-test. Once again, motifsshowing a positional bias, but more importantly havingtheir best matches associated with a given sub-region, canbe singled out. The bias can be for the center (as usuallyshown for the sites of the ChIP’ed TF), but it can be

W536 Nucleic Acids Research, 2013, Vol. 41, Web Server issue

Downloaded from https://academic.oup.com/nar/article-abstract/41/W1/W535/1105963by gueston 10 April 2018

Page 3: PscanChIP: finding over-represented transcription factor-binding site

identified for any other sub-region. Further details on thecalculations performed can be found in the SupplementaryMethods.

These different measures can detect different types ofoverrepresented motifs. The PWM corresponding to theChIP’ed TF or to the co-factor(s) recruiting it should bethe highest ranking (lowest P-values) with respect to allthe three different measures. In case of ChIP-Seq experi-ments with low resolution (e.g. with low quantities ofIP’ed DNA or a weakly specific antibody) and withoutsharp peak summits, the global P-value should be never-theless low. PWMs corresponding to possible interactors(or competitors) of the ChIP’ed TF (often, but not neces-sarily always, binding DNA in its neighborhood) shouldhave low global and local P-values, but rank after theChIP’ed TF. Positional bias could also appear forthese—but not as a rule. ‘Co-regulators’ (TFs that oftenbind the same promoters/enhancers of the ChIP’ed TF)tend to regulate the same genes, but do not have to bindcooperatively with it, and should have low global P-valuesbut not necessarily low local P-values and no positionalbias. On the other hand, PWMs with low local P-valuesonly (better if substantiated by positional bias) might cor-respond to TFs interacting with the ChIP’ed one only in alimited subset of the input.

THE USER INTERFACE

Input

Users have to submit a list of genomic coordinates in BEDformat (e.g. chromosome-start-end) corresponding toChIP-Seq peak regions. PscanChIP assumes that theregions are centered on the point of maximum enrichmentwithin the peak and will automatically retrieve andanalyze the 150-bp sequences around that point(Supplementary Methods). There is no limit on thenumber of regions that can be input. Optionally, ifregions are input by users sorted in decreasing order ac-cording to region enrichment and/or significance P-value(the most enriched at the top), PscanChIP will also assessthe correlation between the enrichment of a motif and theenrichment of the regions in the ChIP-Seq.

Input parameters that have to be set are the organism(human or mouse at the moment) and the genomeassembly that was used for the ChIP-Seq analysis. Thebackground field is used to select the genomic backgroundto be used to assess global enrichment of motifs as ex-plained in the previous section. Finally, the ‘SelectDescriptors’ option specifies whether the analysis has tobe performed with profiles available in the JASPAR orTRANSFAC (public release) databases, and ‘AdditionalDescriptors’ permits to upload a specific matrix set thatwill be added to the chosen database (see the online helppage for further details on the custom matrix format).

The time required by computations depends on thenumber of regions and of matrices involved in theanalysis, anyway seldom exceeding a couple of minuteson typical TF ChIP-Seq experiments of several thousandsof regions.

Output

The output will appear in the middle of the page. For eachmatrix of the selected set and the custom matrices (if any),PscanChIP outputs, as shown in Figure 1a:

(1) Name and ID: the name and the database ID of thematrix.

(2) Local enrichment P-value (L.PV): describing whetherthe motif is over- or underrepresented in the 150 bpinput regions with respect to the genomic regionsflanking them, with an arrow (L.O/U) indicatingwhether the motif is over- (red upward) or underrep-resented (green downward).

(3) Global enrichment P-value (G.PV): describingwhether the motif is over- or underrepresented inthe input regions with respect to the global back-ground used, with the respective arrow (G.O/U).Notice that for computation reasons, this value isnot calculated for user-submitted matrices.

(4) Spearman correlation coefficient (SP.COR): if theinput regions were ranked according to their enrich-ment, this value represents the Spearman rank cor-relation coefficient between the ranking of the inputregions and their ranking with respect to the motif.Positive values indicate that the more enrichedregions are the best matches they contain for themotif, and vice versa.

(5) Preferred position (P.POS) and position bias P-value(P.POS.PV): this indicates the position (10 bp sub-region) within the input regions where oligos tend tobe found with best matches to the matrix.Coordinates are relative to the center of theregions, which has coordinate 0. The associatedP-value indicates the significance of the positionalbias of the motif.

The output table can be sorted according to any columnvalue by clicking on the corresponding header.Clicking on a matrix name (NFYA in Figure 1a), opens

a detailed output page for it (Figure 1b). This page permitsto retrieve, for each region submitted, its genomic coord-inates and the best-matching oligo, its score, and itsposition and strand, with respect to the region itself (0 isthe center of the region). It also includes a simple graph-ical representation of the location of the best matchingoligos within the regions, as well as the correspondinghistogram. Notice that the mostly enriched sub-region ac-cording to the P-value does not have to correspond to thelocal maximum sub-region of the histogram. Rather, dis-agreement between those two measures is an indicator oflow-quality sites clustering in the histogram.

Performing a motif-centered analysis

The ‘Run Pscan-Chip centered on these sites’ buttonassociated with the occurrences list of any PWM permitsto start automatically a motif-centered analysis on thePWM itself. For all the regions containing a best matchwith score higher than the global background average,

Nucleic Acids Research, 2013, Vol. 41, Web Server issue W537

Downloaded from https://academic.oup.com/nar/article-abstract/41/W1/W535/1105963by gueston 10 April 2018

Page 4: PscanChIP: finding over-represented transcription factor-binding site

Figure 1. The overall output of PscanChIP (a), and detailed NFYA PWM output (b) for the ENCODE ChIP-Seq of NF-YA in K562 cells.

W538 Nucleic Acids Research, 2013, Vol. 41, Web Server issue

Downloaded from https://academic.oup.com/nar/article-abstract/41/W1/W535/1105963by gueston 10 April 2018

Page 5: PscanChIP: finding over-represented transcription factor-binding site

PscanChiP will process the 150 bp genomic regions aroundthe matching oligo.

Similar computations are performed, e.g. by the SpaMotool (19), but in our implementation, positional correl-ations between PWMs are incorporated and integratedwith other types of enrichment analysis, overall providinga more complete picture.

The computations of the motif-centered analysis and itsoutput are identical to the default one, with the sole dif-ference that for the matrix chosen (whose best matches fallexactly in the middle of the regions analyzed) PscanChIPwill instead report in the detailed output page the positionof the second best match, to detect whether there is pref-erential arrangements of sites for the same TF.

RESULTS

We ran PscanChIP on several different data setsretrieved from literature and compared the results withother similar tools. The full results are available inSupplementary Material and can be reproduced byrunning PscanChIP on the samples included in the webinterface. All the parameters required will be setautomatically.

‘Core’ TFs in mouse embryonic stem cells

A seminal work showing the full potential of ChIP-Seq isthe analysis of 13 different ‘core’ TFs in mouse embryonicstem cells [Nanog, Oct4, STAT3, Smad1, Sox2, Zfx,c-Myc, n-Myc, Klf4, Esrrb, Tcfcp2l1, E2F1 and CTCF(20)]. Once the binding regions on the genome for eachTF had been determined, further processing identifiedbinding correlations among the different TFs, formingtwo main ‘multi transcription factor loci (MTL)’, thefirst always containing c-Myc and/or n-Myc-bindingsites, but not Nanog, Oct4 and Sox2, and vice versa thesecond.

This data set represents a good benchmark to assesswhether the global enrichment measure used byPscanChIP is able to retrieve the same correlationsobtained in the original study. The matrix for the mainTF investigated was singled out in each experiment as themost enriched both locally and globally, with a markedpositional bias for the summit location, showing the reli-ability of the enrichment measures we used (seeSupplementary Table S1 for the full results), whereas forNanog and Smad1, a PWM for the corresponding TF isnot available either in JASPAR or in the free version ofTRANSFAC included in the interface. Moreover,although we cannot expect the results to be symmetrical,that is, the enrichment for motif X in the ChIP for factorY will be different from the enrichment of motif Y in theChIP of X, matrices corresponding to TFs co-binding in asignificant number of loci can be anyway singled out con-sistently in the ChIP of their partners, with global enrich-ment P-values reproducing the correlations reported in thestudy [see Figure 2 and compare with Figure 4A in (20)].The two MTLs can be clearly identified, one comprising n-Myc, c-Myc, Zfx and E2F1 and the other Oct4, Sox2,Nanog and Smad1 (in the latter two the Oct4 and Sox2

matrices are highly enriched), and also the high correl-ations among Esrrb, Klf4 and Tcfcp2l1. Also, the MycPWMs are underrepresented in the Oct/Sox MTLregions, and vice versa. The STAT3 motif, on the otherhand, shows little enrichment in all the data sets, confirm-ing its lower correlation to the other experiments shown inthe study. The sole exception is the E2F1 regions, inwhich, as also previously observed (15), E2F-likematrices lack marked central enrichment and positionalbias. On the same data set, CENTDIST (run on the best10 000 regions included in the web interface, which is themaximum number of regions it can process) ranks E2F thesecond place after SP1, but using a weakly conserved SP1-like non-canonical motif (SSGCSS), whereas forPscanChIP also the canonical JASPAR E2F1 site (TTTGGCGC) has very low global and local P-values, but norelevant positional bias. In this data set, E-box–like motifs(c-Myc and n-Myc) are among the highest ranking onesfor PscanChIP (global P-value=0), with also positionalbias toward the center of the regions, whereas inCENTDIST, they are ranked at the seventh place. Onthe other hand, these ChIP-Seqs show a significantoverlap between E2F1 and/or n-Myc- and c-Myc-boundsites in ESC cells (35–45% of Myc sites falling within100 bp from an E2F1 site). To further substantiate thehypothesis that an E-box motif could be required forE2F1 binding, at least in some contexts, we appliedPscanChIP to the ENCODE E2F1-binding sites in HeLacells retrieved from the UCSC Genome Browser database(17). Once again, the two highest ranking motifs are E2F1(canonical) and the E-box (n-Myc or c-Myc), with thelowest global and local P-values simultaneously, andboth showing a positional bias toward the summit(Supplementary Table S1). CENTDIST again reportsSP1 as the most enriched motif, ranking the canonicalE2F1 motif at the fourth place, and E-boxes around the10th. YY1, identified by CentriMo as the more likely can-didate for the recruitment of E2F1 (15), presents onlymarginal local enrichment. All in all, our results hint topossible recruitment of E2F1 by one of the E-box–bindingfactors, as indeed has already been demonstrated forc-Myc (21). Finally, in the CTCF sample (>30 000regions), we identified again the E-box as a possible co-binding motif, confirming its preference for c/n-MycMTLs and recovering already known possible associationsbetween CTCF and MYC (22). The same test could not bereplicated with CENTDIST because of the limit of 10 000regions allowed as input.

STAT3 binding in four different cell lines

In a recent work (23), binding regions of STAT3 have beendetermined through ChIP-Seq in four different mousecell lines [embryonic stem cells (ESC), CD4+ T cells,macrophages and AtT-20 cells (PEC)]. Analysis of theregions through motif discovery, protein–protein inter-action data and overlap with other ChIP-Seq experimentsrevealed distinct transcription regulatory modules, withdifferent co-factors singled out to be essential for cell-line–specific binding of STAT3. PscanChIP was able toidentify STAT motifs as the most enriched with respect

Nucleic Acids Research, 2013, Vol. 41, Web Server issue W539

Downloaded from https://academic.oup.com/nar/article-abstract/41/W1/W535/1105963by gueston 10 April 2018

Page 6: PscanChIP: finding over-represented transcription factor-binding site

to every measure used (globally, locally and with positionalbias) in the ESC and AtT-20 data sets (SupplementaryTable S2). In CD4+, STAT motifs are the most enrichedglobally, have a local P-value near 0 and are the only onespresenting positional bias, whereas CENTDIST ranksthem at the second place. In this cell line, IRF andGATA motifs are the most likely candidates to representmain interactors of STAT3 [confirming the findings of(23)], and indeed, the analysis motif centered on STAT3shows their preferential arrangement immediatelyupstream or downstream (Supplementary Figure S2). Onthe other hand, on the ESC data set, PscanChIP confirmedEsrrb as a possible interactor, whereas OCT4 and Sox2,identified in the study as essential co-factors, did not showany significant global or local motif enrichment and amoderate enrichment for CENTDIST with a low numberof estimated regions containing the motif. The fact thatPWMs for other ESC core TFs like Klf4 and Tcfcp2l1are more enriched both globally and locally (the latter con-firmed also by CENTDIST) leads us to conjecture that

other factors could be involved in the recruitment ofSTAT3, or that the respective binding motifs are not es-sential for the function of Oct4 and Sox2. As describedpreviously (23), on the data set comprised by regions inany cell line shared with other ChIP-Seq experiments(‘any three’), PscanChIP is able to retrieve as significantlyenriched also E-box motifs (c-Myc or n-Myc).

Notably, in the PEC macrophage data set, composed of1300 regions, PscanChIP could not find any ‘dominatingmotif’ (i.e. ranking first for all the measures used as in theprevious examples). However, the only motifs with signifi-cantly low P-value (<10�5) both for global and local en-richment were those corresponding to STAT matrices,even if with no clear positional bias toward the regionsummits. We think that the result might be due to experi-mental issues in the ChIP-Seq experiment or in the peakcalling pipeline, because in the other cell lines STATbinding could be clearly determined, and most import-antly in these regions no other motif could be detectedshowing both global and local enrichment, differently

Figure 2. Heatmap of PscanChIP global P-values associated with the JASPAR PWMs of core TFs in mouse ESC peak regions (positive andnegative values denote over- and underrepresented motifs, respectively). Rows correspond to ChIP-Seq experiments, columns to motifs. MTLsreported previously (20) are highlighted in green (c-Myc/n-Myc) and yellow (Nanog/Oct4/Sox2).

W540 Nucleic Acids Research, 2013, Vol. 41, Web Server issue

Downloaded from https://academic.oup.com/nar/article-abstract/41/W1/W535/1105963by gueston 10 April 2018

Page 7: PscanChIP: finding over-represented transcription factor-binding site

from virtually all the other published summit-centereddata sets we analyzed with PscanChIP. The fact that thestudy reports that >70% of these regions do not containany variant of the annotated STAT3 motifs seems tofurther consolidate our hypothesis. On these regions,CENTDIST ranked STAT matrices at the third place,with a P-value of 0.002, because of the moderate bias ofSTAT sites toward the center of the regions. This exampleshows how using different measures of enrichment canhelp solve cases in which there is no clear signalemerging from the data, and to point out possible issueseither in the ChIP or in the downstream bioinformaticanalysis. Finally, a ‘core’ set of 32 binding regions wasdetermined, conserved across the four cell lines, whichco-regulates a set of key genes important for STAT3function in multiple cell types. These 400-bp regionswere not summit centered, as they were derived from theintersection of four different experiments with differentsummits. On this small data set, PscanChIP revealsagain STAT profiles as the most locally and globallyenriched, with, among others, E2F1 (shown to beneeded to recruit STAT3 to these genomic loci), MYCand KLF4 as putative co-regulators. This exampleshows the robustness of the enrichment measures weused even in small and not summit-centered data sets.On this example, CENTDIST fails to find any enrichedmatrix, all the profiles output having P-values of 0.5.

NF-Y binding in human K562 cells

In this experiment, we analyzed the list of ENCODEpeaks for NF-YA in K562 cells retrieved from the

UCSC Genome Browser database (17). The output, alsoshown in Figure 1a, lists several JASPAR matrices withlow P-values, and expectedly the one associated with NF-YA (NFYA, CCAAT-box) is the highest ranking one forall the measures. Moreover, as the input was ranked ac-cording to enrichment, it shows a positive Spearman cor-relation (0.36) with the enrichment in the ChIP. In thedetailed output page for NFYA (Figure 1b), the histo-gram in the bottom right corner shows a clear preferencefor the center of the regions, but with a bimodal distribu-tion (Figure 3a, left), mostly because of the fact that NF-Yoften binds DNA in two consecutive CCAAT boxes at apreferred distance of �33 bp.The fact that there are several other matrices with low

local and global P-values is an indicator that there mightbe other TFs binding cooperatively, each in a significantsubset of the regions. Among these, AP1 (bound by theFOS-JUN complex) shows only a low local P-value but noglobal enrichment, whereas for CENTDIST it does nothave any significant enrichment. Indeed (SupplementaryTable S3 and histograms in Figure 3) results of theanalysis centered on the NFYA motif show that thereare other matrices presenting—other than local or globalenrichment—also marked positional bias for sub-regionsnot corresponding to the center (which is occupied by NF-Y itself), like E-boxes (Myc, MAX, USF and so on), TBP,SP1 and also AP1 PWMs (Figure 3b). Also, the second-best NF-Y motif occurrences have a clear preference for a30–35 bp distance with respect to the central CCAAT box(Figure 3a, right). Indeed, sites predicted for AP1 could befurther ‘validated’ by regions bound by FOS in K562 cells

Figure 3. (a) Histogram of the best motif occurrences of the NFYA PWM in ChIP-Seq regions for NF-YA in K562 cells (left), and of the second-best occurrences in regions centered on the NFYA motif (right). (b) Relative positioning of other motifs in regions for NF-YA centered on theNFYA motif.

Nucleic Acids Research, 2013, Vol. 41, Web Server issue W541

(continued)

Downloaded from https://academic.oup.com/nar/article-abstract/41/W1/W535/1105963by gueston 10 April 2018

Page 8: PscanChIP: finding over-represented transcription factor-binding site

in the respective ChIP-Seq experiment, showing a correl-ation between sites for the two factors in LTR repeatedregions [see (24), also for further discussion on the othermotif arrangements found with respect to the CCAATbox]. The AP1-NFY relationship is an example of a cor-relation resulting from a limited number of secondary sitesthat are not enough to reach a significant global but only amarginal local enrichment, but when present have aprecise positional correlation highlighted by the corres-ponding P-values. This latter case, however, should bethoroughly checked, to determine whether the lack of

global enrichment is indeed because of the presence ofjust a low number of ‘high-quality’ sites (which can beconfirmed by the positional P-value as in this example),or instead to a high number of ‘low-quality sites’, unlikelyto be bound by the corresponding TF.

CONCLUSIONS

PscanChIP is a web server that given a set of ChIP-Seqenriched regions finds overrepresented TFBS motifsdescribed by PWMs. Overrepresentation is evaluated

Figure 3. Continued.

W542 Nucleic Acids Research, 2013, Vol. 41, Web Server issue

Downloaded from https://academic.oup.com/nar/article-abstract/41/W1/W535/1105963by gueston 10 April 2018

Page 9: PscanChIP: finding over-represented transcription factor-binding site

according to different criteria, by adapting the statisticalframework we introduced in our previous tool Pscan.

As other similar tools, PscanChIP can be used toidentify the correct descriptor for the binding specificityof TFs investigated by ChIP-Seq experiments, but as wehave shown in the examples can correctly single out add-itional motifs corresponding to other TFs cooperatingwith the ChIP’ed one, with or without taking advantageof positional arrangements of their respective binding siteswithin the regions. Indeed, the three different measures ofenrichment can solve dubious cases, or help in presence ofnoisy data, and their combination can be used to identifygeneral co-regulators (i.e. usually binding the same pro-moters or enhancers), or to discover precise positionalcorrelations hinting a cooperative binding even in alimited number of the ChIP’ed regions.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online:Supplementary Tables 1–3, Supplementary Figures 1–2and Supplementary Methods.

ACKNOWLEDGEMENTS

The authors thank the anonymous reviewers for theirthorough comments, criticisms and suggestions, whichhelped us to improve the manuscript both in its substanceand its form.

FUNDING

Funding for open access charge: Italian Ministry ofUniversity and Research Fondo Italiano per la Ricercadi Base (FIRB) project ‘Laboratorio Internazionale diBioinformatica’ (LIBI); Consiglio Nazionale delleRicerche (CNR) flagship project EPIGEN.

Conflict of interest statement. None declared.

REFERENCES

1. Geiman,T.M. and Robertson,K.D. (2002) Chromatin remodeling,histone modifications, and DNA methylation-how does it all fittogether? J. Cell. Biochem., 87, 117–125.

2. Furey,T.S. (2012) ChIP-seq and beyond: new and improvedmethodologies to detect and characterize protein-DNAinteractions. Nat. Rev. Genet., 13, 840–852.

3. Pepke,S., Wold,B. and Mortazavi,A. (2009) Computation forChIP-seq and RNA-seq studies. Nat. Methods, 6, S22–S32.

4. Zambelli,F., Pesole,G. and Pavesi,G. (2013) Motifdiscovery and transcription factor binding sites before andafter the next-generation sequencing era. Brief. Bioinform., 14,225–237.

5. Zambelli,F., Pesole,G. and Pavesi,G. (2009) Pscan: finding over-represented transcription factor binding site motifs in sequencesfrom co-regulated or co-expressed genes. Nucleic Acids Res., 37,W247–W252.

6. Frith,M.C., Fu,Y., Yu,L., Chen,J.F., Hansen,U. and Weng,Z.(2004) Detection of functional DNA motifs via statistical over-representation. Nucleic Acids Res., 32, 1372–1381.

7. Hestand,M.S., van Galen,M., Villerius,M.P., van Ommen,G.J.,den Dunnen,J.T. and Hoen,P.A. (2008) CORE_TF: auser-friendly interface to identify evolutionary conserved

transcription factor binding sites in sets of co-regulated genes.BMC Bioinformatics, 9, 495.

8. Ji,X., Li,W., Song,J., Wei,L. and Liu,X.S. (2006) CEAS:cis-regulatory element annotation system. Nucleic Acids Res., 34,W551–W554.

9. Stormo,G.D. (2000) DNA binding sites: representation anddiscovery. Bioinformatics, 16, 16–23.

10. Matys,V., Fricke,E., Geffers,R., Gossling,E., Haubrock,M.,Hehl,R., Hornischer,K., Karas,D., Kel,A.E., Kel-Margoulis,O.V.et al. (2003) TRANSFAC: transcriptional regulation, frompatterns to profiles. Nucleic Acids Res., 31, 374–378.

11. Portales-Casamar,E., Thongjuea,S., Kwon,A.T., Arenillas,D.,Zhao,X., Valen,E., Yusuf,D., Lenhard,B., Wasserman,W.W. andSandelin,A. (2010) JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. NucleicAcids Res., 38, D105–D110.

12. Gerstein,M.B., Kundaje,A., Hariharan,M., Landt,S.G., Yan,K.K.,Cheng,C., Mu,X.J., Khurana,E., Rozowsky,J., Alexander,R. et al.(2012) Architecture of the human regulatory network derivedfrom ENCODE data. Nature, 489, 91–100.

13. Wang,J., Zhuang,J., Iyer,S., Lin,X., Whitfield,T.W., Greven,M.C.,Pierce,B.G., Dong,X., Kundaje,A., Cheng,Y. et al. (2012)Sequence features and chromatin structure around the genomicregions bound by 119 human transcription factors. Genome Res.,22, 1798–1812.

14. Zhang,Z., Chang,C.W., Goh,W.L., Sung,W.K. and Cheung,E.(2011) CENTDIST: discovery of co-associated factors by motifdistribution. Nucleic Acids Res., 39, W391–W399.

15. Bailey,T.L. and Machanick,P. (2012) Inferring direct DNAbinding from ChIP-seq. Nucleic Acids Res., 40, e128.

16. Giresi,P.G., Kim,J., McDaniell,R.M., Iyer,V.R. and Lieb,J.D.(2007) FAIRE (Formaldehyde-Assisted Isolation of RegulatoryElements) isolates active regulatory elements from humanchromatin. Genome Res., 17, 877–885.

17. Rosenbloom,K.R., Sloan,C.A., Malladi,V.S., Dreszer,T.R.,Learned,K., Kirkup,V.M., Wong,M.C., Maddren,M., Fang,R.,Heitner,S.G. et al. (2013) ENCODE data in the UCSCGenome Browser: year 5 update. Nucleic Acids Res., 41,D56–D63.

18. Sabo,P.J., Hawrylycz,M., Wallace,J.C., Humbert,R., Yu,M.,Shafer,A., Kawamoto,J., Hall,R., Mack,J., Dorschner,M.O. et al.(2004) Discovery of functional noncoding elements by digitalanalysis of chromatin structure. Proc. Natl Acad. Sci. USA, 101,16837–16842.

19. Whitington,T., Frith,M.C., Johnson,J. and Bailey,T.L. (2011)Inferring transcription factor complexes from ChIP-seq data.Nucleic Acids Res., 39, e98.

20. Chen,X., Xu,H., Yuan,P., Fang,F., Huss,M., Vega,V.B., Wong,E.,Orlov,Y.L., Zhang,W., Jiang,J. et al. (2008) Integration ofexternal signaling pathways with the core transcriptional networkin embryonic stem cells. Cell, 133, 1106–1117.

21. Leung,J.Y., Ehmann,G.L., Giangrande,P.H. and Nevins,J.R.(2008) A role for Myc in facilitating transcription activation byE2F1. Oncogene, 27, 4172–4179.

22. Lee,B.K., Bhinge,A.A., Battenhouse,A., McDaniell,R.M., Liu,Z.,Song,L., Ni,Y., Birney,E., Lieb,J.D., Furey,T.S. et al. (2012) Cell-type specific and combinatorial usage of diverse transcriptionfactors revealed by genome-wide binding studies in multiplehuman cells. Genome Res., 22, 9–24.

23. Hutchins,A.P., Diez,D., Takahashi,Y., Ahmad,S., Jauch,R.,Tremblay,M.L. and Miranda-Saavedra,D. (2013) Distincttranscriptional regulatory modules underlie STAT3’s cell type-independent and cell type-specific functions. Nucleic Acids Res.,41, 2155–2170.

24. Fleming,J.D., Pavesi,G., Benatti,P., Imbriano,C., Mantovani,R.and Struhl,K. (2013) NF-Y co-associates with FOS at promoters,enhancers, repetitive elements, and inactive chromatin regions,and is stereo-positioned with growth-controlling transcriptionfactors. Genome Res., gr.148080.112 [pii] doi:10.1101/gr.148080.112.

Nucleic Acids Research, 2013, Vol. 41, Web Server issue W543

Downloaded from https://academic.oup.com/nar/article-abstract/41/W1/W535/1105963by gueston 10 April 2018