enhancing the prediction of transcription factor binding sites by incorporating structural...

JOURNAL OF COMPUTATIONAL BIOLOGYVolume 13, Number 4, 2006© Mary Ann Liebert, Inc.Pp. 929–945

Enhancing the Prediction of Transcription FactorBinding Sites by Incorporating StructuralProperties and Nucleotide Covariations

SUMEDHA GUNEWARDENA,1 PETER JEAVONS,2 and ZHAOLEI ZHANG1

ABSTRACT

A problem faced by many algorithms for finding transcription factor (TF) binding sitesis the high number of false positive hits that result with the increased sensitivity of theirprediction. A main contributing factor to this is the short and degenerate nature of thesesites which results in a low signal-to-noise ratio. In order to counter this problem, one needsto look beyond the assumption that individual bases of a TF binding site act independentlyfrom each other when binding to a transcription factor. In this paper, we present a newmethod based on templates, designed to exploit the discriminatory features, nucleotide poly-morphism, and structural homology present in TF binding sites for distinguishing themfrom nonbinding sites.

Key words: template, transcription factor, binding site, linear discriminant analysis.

1. INTRODUCTION

In human and other higher eukaryotes, gene expression is regulated by the binding of variousmodulatory transcription factors (TF) onto cis-regulatory elements near genes. Binding of different

combinations of transcription factors may result in a gene being expressed in different tissue types or atdifferent developmental stages. To fully understand a gene’s function, therefore, it is essential to identifythe transcription factors that regulate the gene and the corresponding TF binding sites. Traditionally, theseregulatory sites were determined by labor-intensive wet-lab techniques such as DNAse footprinting or gel-shift assays. Various computational approaches have been developed to predict TF binding sites in silico,which is an active research area in bioinformatics (Wasserman and Sandelin, 2004; Tompa et al., 2005).

TF binding sites are relatively short (10–20 bp) and highly degenerate sequences, which makes theireffective identification a computationally challenging task. Initial methods for identifying TF binding siteswere based on consensus sequences (Schug and Overton, 1997) and position-specific weight matrices(PSWM) (Quandt et al., 1995; Chen et al., 1995; Fickett, 1996; Stormo and Fields, 1998). PSWMs havebeen shown to be more effective than simple consensus sequences and have been employed in programssuch as Signal Scan (Prestridge, 1991), TESS (Schug and Overton, 1997), MatInspector (Quandt et al.,1995), and MatrixSearch (Chen et al., 1995) for finding binding sites in genomic sequences. Some other

1Banting and Best Department of Medical Research, Donnelly CCBR, University of Toronto, 160 College St.,Toronto, Ontario, M5S 3E1, Canada.

2Oxford University Computing Laboratory, Parks Road, Oxford, OX1 3QD, UK.

929

930 GUNEWARDENA ET AL.

approaches which have been used to identify TF binding sites include rule-based systems (Stormo et al.,1982), Gibbs sampling (Lawrence et al., 1993), expectation maximization (Bailey and Elkan, 1995; Grundyet al., 1996), neural networks (Demeler and Zhou, 1991; O’Neill, 1991; Horton and Kanehisa, 1992), andcomparative genomics (Xie et al., 2005; Wasserman and Sandelin, 2004).

Approaches such as consensus sequences and PSWM, though widely used and in most cases consideredadequate for finding TF binding sites (Benos et al., 2002), are at an intrinsic disadvantage as they are basedon the assumption that individual bases act independently of each other when interacting with transcriptionfactors. They also ignore information such as local sequence context and DNA double-helix structure. Asa result, most of the predicted TF binding sites from these programs are false positives with no biologicalsignificance (Wasserman and Sandelin, 2004). Having realized this weakness, researchers have tried toincorporate other information in the PSWM-based approaches, hoping to reduce the false-positive rate.These include combining predictions with gene expression data (Zhu et al., 2002), using prior knowledgeof gene co-regulation (Kielbasa et al., 2001), or taking advantage of the fact that TF binding sites oftenform clusters (known as cis-regulatory modules or CRM) (Berman et al., 2002).

Nucleotides in TF binding sites do not always act independently in reference to their binding proteins(Wolfe et al., 1999; Man and Stormo, 2001; Udalova et al., 2002). This has led many researchers toquestion the base independence assumption on which techniques such as consensus sequences and weightmatrices for identifying TF binding sites are based (Bulyk et al., 2002; Man and Stormo, 2001; Udalovaet al., 2002). For example, Man and Stormo (2001) showed that the interaction of Salmonella bacteriophagerepressor Mnt with its operator DNA at positions 16 and 17 were not independent. Similar results weredescribed by Barash et al. (2003), Bulyk et al. (2002), and Udalova et al. (2002). All this evidencesuggests an intrinsic weakness in the general base independence assumption of nucleotides in protein-DNA interactions. This issue has been addressed by various authors with a number of more sophisticatedtechniques including improved weight matrices with prior information on correlated nucleotide positions(Zhang and Marr, 1993; Stormo et al., 1986; Zhou and Liu, 2004), biophysical approaches (Djordjevicet al., 2003), nonparametric models (King and Roth, 2003), neural networks (Mahadevan and Ghosh,1994; Horton and Kanehisa, 1992), and principal coordinates analysis (Udalova et al., 2002). This paperdescribes a novel approach for identifying TF binding sites, which among other things exploits the possiblepresence of nucleotide polymorphisms in these sites to improve prediction specificity.

Another feature that plays a role in protein–DNA interactions is nucleotide structure (Aoyama andTakanami, 1985; Nussinov, 1984; El Hassan and Calladine, 1998). It is reasonable to expect, given themultitude of binding sites recognized by the transcription machinery, that there are other factors besidesequence conservation that influence the binding process. It has been shown, for example, that the bindingof transcription factors causes significant distortion to the regular twist and bending of the double helix(Gustafson et al., 1989; Schreck et al., 1990; Stefanovsky et al., 1996). This often results in DNA changingconformation from B-form to A- or Z-form (Lisser and Margalit, 1994). The affinity of the DNA forprotein will depend on its ability to tolerate such structural distortion from classical B-form. Nussinov(1984), for example, demonstrated the presence of structural homology in regions with weak sequencehomology at sites −10, −35, and −16, of the Escherichia coli promoter recognized by its polymerase.Such physical properties of a DNA helix can be expressed in terms of its conformational parameters indinucleotide and trinucleotide models. There are many different such parameters reported in the literature.The Property Database (Ponomarenko et al., 2003, 1999), for example, lists 38 such parameters. Someof the more commonly used conformational parameters are the dinucleotide parameters, propeller twist,protein induced deformability, stacking energy and trinucleotide parameters, DNase I sensitivity (Brukneret al., 1995). Many algorithms have been developed to analyze binding sites based on their structuralhomology (see Lisser and Margalit [1994], Ponomarenko et al. [1997], Ohler et al. [2001], and Thayerand Beveridge [2002]). Structural homology is one of the discriminatory features used in template models,described in this paper.

2. TEMPLATES FOR FINDING TF BINDING SITES

One of the important features of a template is its ability to capture covariance information betweendifferent nucleotide positions, which is not possible with consensus sequences or weight matrices.

TRANSCRIPTION FACTOR BINDING SITE PREDICTION 931

FIG. 1. The 16 sites generated by altering positions 16 and 17 of the Mnt repressor-operator consensus sequence.

For example, Fig. 1 shows the 16 sites obtained from the Mnt repressor-operator consensus sequence“ATAGGTCCACGGTGGACCTGT” by varying the nucleotides in positions 16 and 17. The Mnt bindingprotein shows a preferential binding affinity to nucleotide A in position 16 if position 17 has C. Thebinding affinity changes to nucleotide C at position 16 if position 17 is not nucleotide C (Man and Stormo,2001). Figure 2 shows a plot of these 16 sites in relation to their nucleotides in positions 16 and 17. It canbe seen from this plot that in order to identify all four binding sites described above using a degenerateconsensus sequence or weight matrix, one will have to allow for an A or a C in position 16 and anynucleotide in position 17. As a consequence of the positional independence assumption on which thesemethods are based, allowing for these nucleotide combinations in these positions will inevitably lead tothe false classification of the four sites with nucleotide pairs AA, CC, AG, and AT in positions 16 and 17.

FIG. 2. The 16 Mnt repressor-operator consensus sequence plotted against their nucleotides in position 16 and 17.The surface of the plot shows the log of the mean template error computed for each of these sites.


The template-based method we describe here on the other hand can classify these 16 sites unambiguously.The template error (Def. 2.3) computed for each of the 16 sites is described by the surface of the plot inFig. 2. The troughs on this surface with near zero template error correspond to the four sites with highbinding affinity described above. The other 12 sites with low binding affinity display a relatively hightemplate error which is indicated by the elevation of the surface of the plot.

2.1. Method

2.1.1. Preparation of training sequences. Transcription factor binding sites for our work were obtainedfrom a multitude of sources (Thayer and Beveridge, 2002; Vorobiev et al., 1998; Wingender et al., 2001).Given a set of unprocessed nucleotide sequences which are known binding sites for a particular transcriptionfactor (possibly of varying length), the alignment program in Target Explorer (Sosinsky et al., 2002) wasused to obtain an aligned set of sequences of a fixed length. For purposes of training the classifier, we alsoneed a further set of DNA sequences of the same fixed length, which are not binding sites. These negativeexamples were generated in a number of different ways as described in Section 3.

In the template model, templates are calculated from a given numerical encoding of the nucleotidesforming the training set of binding sites. The numerical encoding can be some value assigned to individualnucleotides or a value assigned to a combination of them. Values can be assigned to single nucleotidesto capture sequence properties (e.g., sequence homology) of the sites. Values can be assigned to di- andtrinucleotides to capture geometric and structural properties (e.g., propeller twist, stacking energy, proteininduced deformability, DNAse I sensitivity, etc.) of the sites. For a given nucleotide sequence s and a givennucleotide parameter p, the resulting numerical vector will be denoted rp(s) (see Fig. 3).

Each nucleotide sequence is first converted into a table of numerical representations consisting of mono,di and tri nucleotide parameter values. The dinucleotide parameters, representing structural properties of thesequences, were obtained from the Property Database (Ponomarenko et al., 1999). In its current release,the database lists 38 different parameter values. We used all 38 of these parameters. The trinucleotideparameters were obtained from Brukner et al. (1995). For each nucleotide sequence s, the representationsof s that we work with will be denoted r1(s), r2(s), . . . , rm(s) where m is the number of parametersselected.

2.1.2. Templates. The key to our method is the use of numerical templates to capture certain keyfeatures of TF binding sites. Base-independent models of TF binding sites do not account for dependenciesthat might be present between nucleotides in different positions of the site when interacting with proteins.One problem of modeling nucleotide polymorphisms in a general model of TF binding sites is that thenucleotide positions that exhibit such correlations vary from factor to factor. As the exact positions onthe TF binding sites which are correlated are unknown in the general case, one would need a modelthat accounts for all pairs of positions on the sites to fully represent them, which will need a very largenumber of parameters (e.g., a fully connected HMM). Templates present a compromise between the baseindependent model and the fully connected model. They model the correlation of an individual positionrelative to the rest of the positions on the site. By restricting the expression of correlation of a givenposition on the sites to all the other positions, instead of individual pairs of positions, templates are able toreduce the number of parameters required to the length of the sites, while still capturing a global expressionof the positional correlation present in them.

A template is modeled by a linear steady state system that we will call a Xi system (denoted �(.)). Thesensitivity of �(.) systems to correlated variations of their potential variables makes them good candidatesfor modelling TF binding sites. The discussion that follows shows how this is done.

FIG. 3. The encoding rp(s) of sequence s by the dinucleotide step parameter values p = “Slide.”


FIG. 4. A �(.) system with L potential nodes.

Definition 2.1. We define a �(.) system as a linear steady state system comprising of a set of potentialnodes p, multipliers a, and balancing nodes b (see Fig. 4). Each potential node has a unique multiplierassociated with it. The network of potential nodes forms a fully connected graph, connected via themultipliers a[1], a[2], . . . , a[L] that act as distribution points for their potentials. Each potential node isalso associated with a unique balancing node. These are external nodes b[1], b[2], . . . , b[L] that providea balancing potential to the system.

The balancing equations of a �(.) system in a steady state with L potential nodes is given by

p[j ] =L∑

i=1;i �=j

a[i] p[i] + b[j ] j = 1, . . . , L (1)

which can be expressed in matrix form as

(Q diag(p)) a = p − b (2)

where p = (p[1], p[2], . . . , p[L])T , a = (a[1], a[2], . . . , a[L])T , b = (b[1], b[2], . . . , b[L])T , and Q(L×L)

is a square matrix with zeros on the diagonal and ones everywhere else.

Definition 2.2. A template, t , of length L, is a numerical vector (t[1], t[2], . . . , t[L]) characterizedby the multipliers a[1], a[2], . . . , a[L] of a �(.) system in steady state with L input potentials such that∀(i=1:L) t[i] = a[i].

The vector t = (t[1], t[2], . . . , t[L]) will be referred to as the template parameters of t .

Definition 2.3. For any numerical vector r = (r[1], r[2], . . . , r[L]), the prediction error (also referredto as the template error) of r with respect to the template t will be denoted as E(r, t) and is defined asthe sum of squared balancing potentials of the �(.) system.

E(r, t) = b[1]2 + b[2]2 + · · · + b[L]2 (3)

Given a numerical vector r, we can find a set of template parameters t that minimizes the prediction errorE(r, t) for that vector. This minimization process is referred to as training the template. The template tthat minimizes E(r, t) for the vector r is obtained as follows:

E(r, t) = arg mint

(b[1]2 + b[2]2 + · · · + b[L]2

)

= arg mint

(bT b

).

Making the substitution b = p − (Q diag(p)) a, p = r, and a = t

E(r, t) = arg mint

((r − Qrt)T (r − Qrt)

)where Qr = (Q diag(r)).


For any set of numerical vectors, {r1, r2, . . . , rn}, the mean value of the prediction error with respectto a fixed template t is given by

1

n

n∑k=1

E(rk, t). (4)

The template that minimizes this mean error value for this set of vectors can be obtained by calculatingthe partial derivatives of Equation (4) with respect to t[1], t[2], . . . , t[L] and setting each of these equalto zero. This gives the following set of L linear equations:

t =[

n∑k=1

QrTk Qrk

]−1 [n∑

k=1

QrTk rk

](5)

where Qrk = Q diag(rk).These equations are symmetric and can be solved efficiently to find the set of template parameters

t[1], t[2], . . . , t[L], which minimizes the mean prediction error for the set of vectors.The balancing potentials of a �(.) system have an intrinsic covariance associated with them. These

interdependencies are reflected in the balancing equations described in Equation (2). The prediction erroris defined in relation to these potentials. Our purpose in solving for a set of template parameters is toobtain a set of values for t that minimizes the prediction error relative to a given set of training sites.

2.1.3. Evidence of nucleotide covariation captured by templates. Templates account for nucleotidecovariations when distinguishing TF binding sites from nonbinding sites. The template error of sites withpatterns that reflect similar correlated nucleotide variations as those sites in the training data is relativelylow compared to those sites that do not reflect such variations.

Consider the sites shown in Fig. 1. Of the 16 sites listed, 4 have a high affinity to the Mnt repressor-operator protein. These are the sites characterized by nucleotides AC, CA, CT, and CG in positions 16and 17, respectively (the first four sites in the list). As was explained above, it is not possible to fullyseparate these four sites from the rest of the sites using a linear discriminator, the main hindrance being theseparation of these sites from the four sites with nucleotides AA, AG, AT, and CC in these two positions(the last four sites in the list). The reason for this is that the first four sites with high binding affinity and thelast four sites with low binding affinity are both represented by the same IUPAC consensus sequence MNin positions 16 and 17. They differ from each other only by the different correlated variations of nucleotidesin these two positions. If one discounts this correlated variation, as is the case in linear discriminatorymodels, there would in essence be no difference between these two sets of sites.

Figure 5 shows the template error for the sites given in Fig. 1. In it, a clear distinction can be seenbetween the low template error of the high affinity sites (the first four sites) and the relatively high templateerror for the rest of the sites including the last four sites (with low binding affinity) which have the sameconsensus sequence in positions 16 and 17 as the first four sites and vary only in the relative variation innucleotides in these two positions.

The covariance effects captured by a template can be negated by weighting the least-squares fit forfinding the template parameters by the inverse, variance-covariance matrix of its balancing potentials (seeBox 1).

The template error so obtained for the 16 sites is shown in Fig. 6. As expected, when the covarianceeffects are removed from the template error, the sites most effected are the four sites with residues AA,AG, AT, and CC in positions 16 and 17 respectively whose template errors approach zero, closer to thevalues of the high affinity sites (note that the weighted least squares fit for finding the template parametersis an approximate iterative procedure which could explain the slightly higher template errors of the lastfour sites relative to the first four sites). These are the sites that differ from the high affinity sites only intheir covariation pattern of the two nucleotides in positions 16 and 17.

This simple but illustrative example demonstrates the ability of templates to capture nucleotide covaria-tions for discriminative purposes. The important point is that the templates are not provided with any priorinformation on which nucleotide positions are correlated.


FIG. 5. The template error for the 16 sites given in Fig. 1.

Box 1 Weighted least squares fit of the template error.

The weighted least-squares fit of the template error can be expressed by the set of equations

E(r, t) = arg mint

(n∑

k=1

(rk − Qrk t

)T W−1 (rk − Qrk t

))(6)

where W is the variance-covariance matrix of the balancing potentials of the � system. The solutionsto these equations found by equating their partial derivatives to zero leads to the template parameters.

t =[

n∑k=1

QrTk W−1 Qrk

]−1 [ n∑k=1

QrTk W−1 rk

](7)

As the value of W is dependent on t, there is no closed-form solution for the above equations. Anapproximate solution for t and W is found using the following iterative procedure.

• Initial state:1. Compute t(0) using OLS Equation (5).

• At the nth iteration:2. Compute the weight matrix W(n).

3. Compute t(n) using Equation (7).4. If t(n) is very close to t(n−1), then stop; otherwise go to step 2.


FIG. 6. The template error for the 16 sites given in Fig. 1 with covariance effects removed.

2.1.4. Classification. Once the templates have been trained, we use linear discriminant analysis (LDA)to distinguish binding sites from nonbinding sites based on their template errors. For this, a linear dis-criminant analyzer is trained on the template errors of known positive and negative examples to opti-mally separate the two classes. LDA is a data classification technique that seeks to maximize the ratio ofbetween-class variance to the within-class variance in a dataset. There are two main approaches to LDA,class-dependent transformation and class-independent transformation. We use the second approach, whichinvolves maximizing the ratio of overall variance to within-class variance. This approach is more suitablewhen the amount of training data of the different classes is limited and is more efficient in that it usesonly one optimization criterion to transform the datasets irrespective of their class identity.

The LDA classification procedure is as follows: Let ms(m×1) be the vector of mean prediction errorsfrom m templates for a set Xs(ns×m) of template errors of positive sites. Let mn(m×1) be the vector ofmean prediction errors of the templates for a set Xn(nn×m) template errors of negative sites. The globalmean prediction error mg(m×1)

of the training examples for the templates is computed as

mg = (ns − 1)ms + (nn − 1)mn

(ns + nn − 2). (8)

In LDA, the criterion for separating different classes is based on the within-class and between-class scatterof the training data in each class. The within-class scatter Sw(m×m), i.e., the expected covariance of eachclass, is computed using the equation

Sw = (ns − 1) cov(Xs) + (nn − 1) cov(Xn)

(ns + nn − 2)(9)

where cov(X) is the variance–covariance matrix of X. The between-class scatter Sb(m×m) is computedusing the equation

Sb = (ms − mg)(ms − mg)T + (mn − mg)(mn − mg)

T (10)


The between-class scatter can be seen as the covariance of the dataset whose members are the mean vectorsof each class. Once we have computed Sw and Sb, we can obtain the optimization criterion O(m×m) usingthe equation

O = Sw−1Sb. (11)

To build a transformation matrix O(m×k) of reduced dimensions from O, we select all the eigenvectors(k ≤ m) of O with nonzero eigenvalues. Given two vectors of template errors, x(m×1) and y(m×1), thesquared distance between these two vectors in the transformed space is given by xT O y where O = O OT .

Given any vector x(m×1) of template errors, let Ds(x)2 be the squared distance in the transformed spacebetween a vector x and the mean vector ms of prediction errors for a set of positive sites. Let Dn(x)2 bethe squared distance in the transformed space between a vector x and the mean vector mn of predictionerrors for a set of negative sites. These two quantities are given by

Ds(x)2 = (x − ms)T O (x − ms),

Dn(x)2 = (x − mn)T O (x − mn).

(12)

Note that if the matrix O was the inverse of the pooled covariance matrix of all of the error vectors in thetraining set (i.e., Sw

−1), then the above distances would be the Mahalanobis distances (Webb, 2002) ofthe given vector x to the mean of the class of binding sites and class of nonbinding sites, respectively.

We can simplify the above two equations to a single quantity as follows

D(x) = Dn(x)2 − Ds(x)2 = A x + B (13)

where A = 2 (mn − ms)T O and B = 1

2 A (mn + ms). D(x) is the signed distance spreading an arbitraryvector x, and the discriminator hyper-plane, D(x) = 0, located at the half-distance between the means msand mn of the binding sites and nonbinding sites used for training the classifier. Figure 7 shows the plotof D(.) for the 16 sites shown in Fig. 1. A positive value of D(x) corresponds to the vector x representinga binding site, and a negative value corresponds to the vector x representing a nonbinding site.

2.1.5. Procedure summary. For convenience of explanation, we will use a single type of encoding, inthis case dinucleotide parametric data, in the discussion that follows. The values can be changed accordinglyif mononucleotides, trinucleotides, or a mix of them are used. The dimensions of the type of parametersselected effect only the encoded length of the template relative to the length of the given site. All othercomputations remain the same.

The positive examples in the training data (all of fixed length L + 1) are divided into two sets, one,S1 = {s11 , s12 , . . . , s1n1

}, of n1 sequences for training the templates and the other, S2 = {s21 , s22 , . . . , s2n2},

of n2 sequences for training the classifier. Using the sequences S1, for each dinucleotide parameter p (p =1 . . . m where m is the number of parameters used), we construct a set of n1 numerical vectors of length L,{rp(s11), rp(s12), . . . , rp(s1n1

)} (note that the length of a sequence encoded using dinucleotide parametersis one less than its original length). For each of these sets of vectors, we calculate the correspondingtemplate, tp, using Equation (5). Hence, we obtain m different templates t1, t2, . . . , tm, which are usedfor classification.

The classifier is trained on the remaining set of positive examples S2 and a set of negative examples,S3 = {s31 , s32 , . . . , s3n3

}, of n3 sequences of the same length L+1. The first step in training the classifier isto compute the template error over the m templates for each sequence s21 , s22 , . . . , s2n2

, s31 , s32 , . . . , s3n3of

the set of positive and negative examples after encoding them using the appropriate dinucleotide parametercorresponding to the given template. The mean value of the error vectors over the positive examples S2 andnegative example examples S3 are computed. This gives a single vector to represent the class of bindingsites ms(m×1) and class of nonbinding sites mn(m×1). Using these vectors, ms and mn, and the variance–covariance matrices of the template errors of the sequences S2 and S3, we compute the transformationmatrix O as described above.

Given any unclassified nucleotide sequence s of length L + 1, we can represent it as described aboveby using m numerical vectors of length L, r1(s), r2(s), . . . , rm(s). For each of these numerical vectors,


FIG. 7. The difference in distance in LDA transformed space of a site to the mean of the nonbinding sites and meanof the binding sites of the training data. Positive values indicate sites that are closer to the mean of the binding sites,and negative values indicate sites that are closer to nonbinding sites.

we compute the prediction error with respect to the corresponding template and hence obtain a vector ofm prediction errors, which we call the error vector for s, and denote by v(s). That is,

v(s) =(E(r1(s), t1), . . . , E(rm(s), tm)

). (14)

The error vector v(s) is used to classify the nucleotide sequence s as a binding site or a nonbindingsite based on the sign of D(v(s)). The unknown nucleotide sequence s is classified as a binding site ifD(v(s)) ≥ 0, and it is classified as a nonbinding site if D(v(s)) < 0. A worked example of the aboveprocedure is described in Figs. 8–9.

3. RESULTS

We tested the template models on the following transcription factor binding sites. Data for the CAPbinding sites were obtained from Thayer and Beveridge (2002). Data for the other binding sites wereobtained from the Samples Database (Vorobiev et al., 1998).

• Escherichia coli catabolite gene activator protein (CAP). The Escherichia coli catabolite gene activatorprotein (CAP) is a DNA binding protein involved in bacterial regulation that triggers the transcriptionof catabolite operons in the absence of glucose. It recognizes a 22 bp two-fold-symmetric DNA sitewith the consensus sequence AAATGTGATCTAGATCACATTT. This sequence has a highly conservedTGTGA motif and its inverted repeat TCACA, symmetrically placed from the centre of the polymer.

• Nuclear factor kappa B (NF−κB/Rel) (p50/p65). The NF−κB/Rel complex is a heterodimer composedof two DNA-binding subunits (NF−κB1 and relA). It plays a key role in the regulation of many genes thatcode for mediators of the immune, acute phase, and inflammatory responses. Homo- and heterodimers ofmembers of the NF−κB/Rel family recognize nucleotide sequences with the consensus GGGACTTTCC.The 36 NF − κB transcription factor DNA binding sites obtained from the SAMPLES database relate


FIG. 8. Worked example part A.

to the organisms Rattus norvegicus (Norway rat), human immunodeficiency virus type 1, Mus musculus(house mouse), Homo sapiens (human), human herpesvirus 5, simian virus 40, and human adenovirustype 2.

• Activating protein-1 (AP − 1). The dimeric transcription factor AP − 1 is a multiprotein complex madeup of Fos and Jun proteins together with ATF/CREB proteins. Factor AP − 1 controls both basal andinducible transcription of several genes. It binds to the consensus sequence TGACTCA (TPA-responsive


FIG. 9. Worked example part B.

element (TRE)). The 41 AP − 1 sites used above obtained from the SAMPLES database relate to theorganisms simian virus 40, human adenovirus type 2, hepatitis B virus, human herpesvirus 5, humanpapillomavirus type 16, homo sapiens (human), Mus musculus (house mouse), polyomavirus, Rattusnorvegicus (Norway rat), visna virus, Drosophila melanogaster (fruit fly), and Gallus gallus (chicken).

• Nuclear factor-1 (NF − 1). NF − 1 is a family of transcription factors that have been shown to playimportant roles in tissue specific transcription of differentiation associated genes in a number of tissues.NF − 1 proteins bind as dimers to the duplex DNA sequence with consensus TTGGCNNNNNGCCAA.The 71 NF −1 sites obtained from the SAMPLES database relate to the organisms feline leukemia virus,


BK virus, Homo sapiens (human), Rattus norvegicus, (Norway rat), hepatitis B virus, Mus musculus(house mouse), human adenovirus type 2, xenopus laevis (African clawed frog), GR mouse mammarytumor virus, murine leukemia virus, murine retrovirus SL3-3, Sus scrofa (pig), human herpesvirus 5,human papillomavirus type 16, human T-cell lymphotropic virus type 1, JC virus, Gallus gallus (chicken),Ovis aries (sheep), and human herpesvirus 1.

• CCAAT box/enhancer binding protein (C/EBP). C/EBP forms dimers and consists of an activationdomain, a DNA-binding basic region, and a leucine-rich dimerization domain (leucine zippers). It bindsto the CCAAT-box with the consensus sequence GGYCAATCT. The 75 CEBP sites obtained from theSAMPLES database relate to the organisms Homo sapiens (human), Mus musculus (house mouse),hepatitis B virus, Gallus gallus (chicken), avian myeloblastosis virus, fujinami sarcoma virus, moloneymurine sarcoma virus, polyomavirus, Rattus norvegicus (Norway rat), rous-associated virus type 0,simian virus 40, xenopus laevis (African clawed frog), rous sarcoma virus, human adenovirus type 2,and human herpesvirus 1.

Unlike positive examples, which are available from the different databases on TF binding sites, there areno known verifiable data on nonbinding sites for a given factor. The only definite means of confirming thata given site is nonbinding to a particular factor is to verify it experimentally in a wet lab test. Any selectionof data without such definite verification will leave a certain level of ambiguity on the true state of thedata. Appreciating this fact, we approached the problem of assessing the performance of the template-basedclassifiers in two different ways in relation to the negative examples selected.

The results for the first of these experiments are shown in Table 1. The negative examples for thisexperiment were generated as follows. We took genomic sequences from the different organisms thatcontributed towards the TF binding sites of the given factor. All sequences that matched the known TFbinding sites of the factor were filtered out of these sequences. The filtered sequences were used to generatesites for the negative examples. We will refer to these sites as negative sites (although there is no guaranteethat these are strictly nonbinding sites for the given factor). For each experiment we used a set of 1,260negative sites extracted randomly from the filtered sequences of the different species contributing towardsthe TF binding sites of the given factor. One hundred thirty of the 1,260 negative sites extracted for eachfactor were used to train the classifier, and the remaining 1,130 were used in the test dataset for that factor.

To investigate the robustness of our method, we ran every experiment 100 times, randomly selecting thetraining sequences for the templates and the classifier from the available positive and negative exampleson each trial. The results in Table 1 show the mean values for the different statistics, with the standarddeviation in brackets. The specificity values given in Table 1 are quite high indicating that the number offalse positives given by the classifier is quite low.

To obtain another estimate of the rate of false positives, we ran the same templates in a test set of 1,130random sequences. The results, the expected false positive rate for the templates listed in Table 1, areshown in Table 2.

The results in Table 2 are empirical estimates of the expected false positive rate of the templates. Theseresults were obtained on a test set of 1,130 random sequences taken as negative sites (there is no guaranteethat they are all nonbinding sites) performed over 100 experiments with a new set of data randomly selectedfor each iteration.

Table 1. Statistics for Five Different Transcription Factorsa

Classify Test

TF Train T F T FSensitivity

TestSensitivityTest+Tra

SpecificityTest

SpecificityTest+Tra

NF-κB 7 7 130 22 1130 0.90 (0.05) 0.96 (0.03) 0.94 (0.01) 0.96 (0.01)CAP 7 7 130 11 1130 0.85 (0.08) 0.94 (0.03) 0.91 (0.02) 0.94 (0.02)NF-1 14 14 130 43 1130 0.90 (0.04) 0.94 (0.04) 0.93 (0.02) 0.95 (0.02)CEBP 13 13 130 49 1130 0.86 (0.04) 0.93 (0.04) 0.90 (0.02) 0.93 (0.02)AP-1 9 9 130 23 1130 0.77 (0.04) 0.87 (0.08) 0.91 (0.03) 0.93 (0.03)

aThe sensitivity and specificity values listed under the columns “Test” are those values obtained from only the test dataset previouslyunseen by the classifier and not used for training the templates. The values under the columns “Test+Tra” are those values obtainedfrom the whole dataset, which gives an idea of how well the classifier does on the training data.


Table 2. Empirical Estimates of theExpected False Positive Rate of the

Template-Based Classifiers

TF Expected false positive rate

NF-κB 0.025 (0.009)CAP 0.050 (0.020)NF-1 0.060 (0.017)CEBP 0.095 (0.022)AP-1 0.062 (0.022)

4. DISCUSSION

We have described a novel approach for distinguishing TF binding sites from nonbinding sites. Theapproach described is based on templates that are sensitive to positional covariations. These can be covari-ations expressing sequence or structural polymorphisms as described by the different parametric encodingof the nucleotide sequence. Templates work in sets, usually containing more than one element, with eachtemplate characterizing a different sequence or structural property of the sites. The amalgamation of dif-ferent templates optimally selected to work in unison endows a synergic effect on the discriminative andpredictive capabilities of the system.

The training phase of the system requires experimental binding data, a subset representing all the potentialbinding sites. One advantage of templates is their ability, unlike other machine learning techniques such asneural networks or hidden Markov models, to learn quite well from a minimal number of examples. Thisis a feature that has many practical advantages when we are dealing with a dearth of properly annotatedexamples. Theoretically, a single site is sufficient to construct a template though the resulting templatemay not well characterize the whole population. This is in contrast to normal regression techniques thatrequire the cardinality of the set of training examples to be at least as great as the number of unknownparameters.

Binding assays of transcription factors such as NF −κB, Zif 268 zinc fingers and Mnt repressor-operatorproteins suggest strong evidence for the existence of nonindependent effects on positional interactionswhen at least some proteins bind to DNA. The exact positions that exhibit such interdependent effects varyfrom one factor to another, and there is no evidence that all transcription factors exhibit a similar patternof behavior. This makes it difficult to capture such properties in a general model. The requirement is formodels that can capture such behavior from a set of training data alone.

The sensitivity of templates to positional covariations is not based on any prior knowledge of whichpositions exhibit polymorphic behavior. This is an important characterization, especially in the absenceof such prior knowledge individualizing a family of binding sites, which is usually the case. It is notalways practical to build exhaustive models detailing the different covariations present between individualpositions. Models such as neural networks and HMMs that are able to account for such informationsuffer from the practical drawback of balancing between the complexity of the system and the number ofexamples required to train it well. In these systems, the complexity of the model architecture imposes lowerbounds on the number of examples required to form a good training set. These bounds usually increaseexponentially with the increase in complexity of the system.

There is evidence (Aoyama and Takanami, 1985; Nussinov, 1984; el Hassan and Calladine, 1996) thatsuggests the presence of structural homologies in DNA sequences that interact with some transcriptionfactors. This is the case in, for example, the E. coli catabolite gene activator protein binding sites. Whatthese structural homologies are and exactly what geometric features play a part in them is not always veryclear or easy to ascertain. Programs that incorporate such features do so with an implicit assumption of thepresence of these properties in the sequences that they analyze. This is a weak assumption that may be ten-tative in the absence of specific knowledge of their presence and would not hold for the general case. It ispossible for different binding sites to exhibit different structural properties intrinsic to the particular factorthat they bind to. It is also possible for some binding sites not to display any significant structural homologyfor any of the known structural parameters. In such cases, one has only sequence homology to rely on.


Templates can model both sequence and structural homology. The important fact when modeling tem-plates for a particular family of TF binding sites is that we do not make any prior decision on whichstructural parameters to use. The selection of the best set of parameters is done automatically during thetraining phase of the system. This reduction in dimensionality is achieved by LDA. The feature extractionprocess removes redundant and irrelevant information providing a more stable representation of the datathat leads to improved classification.

The time complexity of searching a whole genome for possible transcription factor binding sites usingthe method described here depends on three factors: the length of the genome G, the length of thetemplate L, and the number of structural parameters used P (in actual fact, this comes down to thereduced dimensionality of the feature space after feature extraction). The error vector for a sequence oflength L, defined in Equation (4), can be computed in ©(PL2) time. The two distance measures for sitesand nonsites used in the classifier can be computed in ©(P 2) time. This gives an overall time complexityof ©(PL2+P 2) for processing a single site in the genome. This has to be done for G−L sites in the wholegenome being searched. The entire process will therefore have a time complexity of © (

GP(L2 + P)).

While G can be relatively large, L and P are generally small. The length of the templates, L, is similarto the length of the binding sites and would typically be around 10 to 12. The number of nucleotideparameters, P extracted, plays an important part in the classification phase. We have found in the analysisthat we have carried out that generally the system settles down to extracting 10 to 15 components fromthe transformed space for classification.

We have implemented the template model described above in the software system Matlab®. The totaltime required to compute the templates and train the classifier was around five to six seconds for eachtranscription factor on a PC running at 466 MHz. The time required to run the classifier on the test sets(≈1,260 sites) was around one to two seconds for each transcription factor. This scales to a time of around20 minutes to scan a genomic sequence of a million base pairs.

Further investigation is needed to determine whether the template-based approach described here canbe successfully applied to the identification of binding sites for other transcription factors not studied hereand whether the method can be refined to achieve even greater accuracy.

REFERENCES

Aoyama, T., and Takanami, M. 1985. Essential structure of E. coli promoter II. Effect of the sequences around theRNA start point on promoter function. Nucl. Acids Res. 13(11), 4085–4096.

Bailey, T.L., and Elkan, C. 1995. Unsupervised learning of multiple motifs in biopolymers using expectation maxi-mization. Machine Learning 21(1–2), 51–80.

Barash, Y., Elidan, G., Friedman, F., and Kaplan, T. 2003. Modeling dependencies in protein-DNA binding sites.RECOMB 2003, 28–37.

Benos, P.V., Bulyk, M.L., and Stormo, G.D. 2002. Additivity in protein-DNA interactions: How good an approximationis it? Nucl. Acids Res. 30(20), 4442–4451.

Berman, B.P., Nibu, Y., Pfeiffer, B.D., Tomancak, P., Celniker, S.E., Levine, M., Rubin, G.M., and Eisen, M.B. 2002.Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formationin the Drosophila genome. Proc. Natl. Acad. Sci. USA 99(2), 757–762.

Brukner, I., Sanchez, R., Suck, D., and Pongor, S. 1995. Sequence-dependent bending propensity of DNA as revealedby DNase I: Parameters for trinucleotides. EMBO J. 14, 1812–1818.

Bulyk, M.L., Johnson, P.L.F., and Church, G.M. 2002. Nucleotides of transcription factor binding sites exert interde-pendent effects on the binding affinities of transcription factors. Nucl. Acids Res. 30(5), 1255–1261.

Chen, Q.K., Hertz, G.Z., and Stormo, G.D. 1995. MATRIX SEARCH 1.0: A computer program that scans DNAsequences for transcriptional elements using a database of weight matrices. Comput. Appl. Biosci. 11, 563–566.

Demeler, B., and Zhou, G. 1991. Neural network optimization for E. coli promoter prediction. Nucl. Acids Res. 19,1593–1599.

Djordjevic, M., Sengupta, A.M., and Shraiman, B.I. 2003. A biophysical approach to transcription factor binding sitediscovery. Genome Res. 13(11), 2381–2390.

el Hassan, M.A., and Calladine, C.R. 1996. Propeller-twisting of base-pairs and the conformational mobility of dinu-cleotide steps in DNA. J. Mol. Biol. 259, 95–103.

El Hassan, M.A., and Calladine, C.R. 1998. Two distinct modes of protein-induced bending in DNA. J. Mol. Biol.282(2), 331–343.


Fickett, J.W. 1996. Quantitative discrimination of MEF2 sites. Mol. Cell. Biol. 16(1), 437–441.Grundy, W.N., Bailey, T.L., Elkan, C.P. 1996. ParaMEME: A parallel implementation and a web interface for a DNA

and protein motif discovery tool. Comput. Appl. Biol. Sci. (CABIOS) 12(4), 303–310.Gustafson, T., Taylor, A., and Kedes, L. 1989. DNA bending is induced by a transcription factor that interacts with

the human c-FOS and alpha-actin promoters. Proc. Natl. Acad. Sci. USA 86(7), 2162–2166.Horton, P.B., and Kanehisa, M. 1992. An assessment of neural network and statistical approaches for prediction of

E. coli promoter sites. Nucl. Acids Res. 20, 4331–4338.Kielbasa, S.M., Korbel, J.O., Beule, D., Schuchhardt, J., and Herzel, H. 2001. Combining frequency and positional

information to predict transcription factor binding sites. Bioinformatics 17(11), 1019–1026.King, O.D., and Roth, F.P. 2003. A nonparametric model for transcription factor binding sites. Nucl. Acids Res. 31(19),

e116.Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., and Wootton, J.C. 1993. Detecting subtle

sequence signals: A Gibbs sampling strategy for multiple alignment. Science 262(5131), 208–214.Lisser, S., and Margalit, H. 1994. Determination of common structural features in Escherichia coli promoters by

computer analysis. Eur. J. Biochem. 223(3), 823–830.Mahadevan, I., and Ghosh, I. 1994. Analysis of E. coli promoter structures using neural networks. Nucl. Acids Res.

22(11), 2158–2165.Man, T.K., and Stormo, G.D. 2001. Non-independence of Mnt repressor-operator interaction determined by a new

quantitative multiple fluorescence relative affinity (QuMFRA) assay. Nucl. Acids Res. 29(12), 2471–2478.Nussinov, R. 1984. Promoter helical structure variation at the Escherichia coli polymerase interaction sites. J. Biol.

Chem. 259, 6798–6805.Ohler, U., Niemann, H., Liao, G.C., and Rubin, G.M. 2001. Joint modeling of DNA sequence and physical properties

to improve eukaryotic promoter recognition. Bioinformatics 17, 199–206.O’Neill, M.C. 1991. Training back-propogation neural networks to define conducted DNA-binding sites. Nucl. Acids

Res. 19, 313–318.Ponomarenko, J.V., Merkulova, T.I., Orlova, G.V., Fokin, O.N., Gorshkova, E.V., Frolov, A.S., Valuev, V.P., and

Ponomarenko, M.P. 2003. rSNP_Guide, a database system for analysis of transcription factor binding to DNA withvariations: Application to genome annotation. Nucl. Acids Res. 31(1), 118–121.

Ponomarenko, J.V., Ponomarenko, M.P., Frolov, A.S., Vorobyev, D.G., Overton, G.C., and Kolchanov, N.A. 1999.Conformational and physicochemical DNA features specific for transcription factor binding sites. Bioinformatics15(7/8), 654–668.

Ponomarenko, M.P., Ponomarenko, J.V., Kel, A.E., and Kolchanov, N.A. 1997. Search for DNA conformational featuresfor functional sites. Investigation of the TATA box. Biocomputing: Proc. 1997 Pac. Symp., 340–351.

Prestridge, D.S. 1991. Signal Scan: A computer program that scans DNA sequences for eukaryotic transcriptionalelements. Comput. Appl. Biosci. 7, 203–206.

Quandt, K. and Frech, K. and Karas, H. and Wingender, E. and Werner, T. 1995. MatInd and Matinspector: Newfast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucl. Acids Res. 23, 4878–4884.

Schreck, R., Zorbas, H., Winnacker, E.L., and Baeuerle, P.A. 1990. The NF-kappa B transcription factor induces DNAbending which is modulated by its 65-kD subunit. Nucl. Acids Res. 18(22), 6497–6502.

Schug, J., and Overton, G.C. 1997. TESS: Transcription element search software on the WWW. Technical reportCBIL-TR-1997-1001-v0.0, Computational Biology and Informatics Laboratory, School of Medicine, University ofPennsylvania.

Sosinsky, A., Wildonger, J., Bonin, K., Mann, R., and Honig, B. 2002. Target explorer: an automated tool for identi-fication of new target genes for specified set of transcription factors (poster). ISMB ’02.

Stefanovsky, V.Y., Bazett-Jones, D.P., Pelletier, G., and Moss, T. 1996. The DNA supercoiling architecture induced bythe transcription factor xUBF requires three of its five HMG-boxes. Nucl. Acids Res. 24(16), 3208–3215.

Stormo, G.D., and Fields, D.S. 1998. Specificity, energy and information in DNA–protein interactions. Trends Bio-chemical Sciences 23, 109–113.

Stormo, G.D., Schneider, T.D., and Gold, L. 1986. Quantitative analysis of the relationship between nucleotide sequenceand functional activity. Nucl. Acids Res. 14, 6661–6679.

Stormo, G.D., Schneider, T.D., Gold, L., and Ehrenfeucht, A. 1982. Use of the ‘Perceptron’ algorithm to distinguishtranslational initiation sites in E. coli. Nucl. Acids Res. 10, 2997–3011.

Thayer, K.M., and Beveridge, D.L. 2002. Hidden Markov models from molecular dynamics simulations on DNA.Proc. Natl. Acad. Sci. 99(13), 8642–8647.

Tompa, M., Li, N., Bailey, T.L., Church, G.M., De Moor, B., Eskin, E., Favorov, A.V., Frith, M.C., Fu, Y., Kent, W.J.,Makeev, V.J., Mironov, A.A., Noble, W.S., Pavesi, G., Pesole, G., Regnier, M., Simonis, N., Sinha, S., Thijs, G., vanHelden, J., Vandenbogaert, M., Weng, Z., Workman, C., Ye, C., and Zhu, Z. 2005. Assessing computational toolsfor the discovery of transcription factor binding sites. Nature Biotechnol. 23(1), 137–144.


Udalova, I.A., Mott, R., Field, D., and Kwiatkowski, D. 2002. Quantitative prediction of NF-kB DNA–protein inter-actions. Proc. Natl. Acad. Sci. USA 99, 8167–8172.

Vorobiev, D.G., Ponomarenko, J.V., and Podkolodnaya, O.A. 1998. Samples and aligned databases for functional sitesequences. Proc. I Int. Conf. on Bioinformatics of Genome Regulation and Structure, 58–61.

Wasserman, W., and Sandelin, A. 2004. Applied bioinformatics for the identification of regulatory elements. Nat. Rev.Genet. 5(4), 276–287.

Webb, A. 2002. Statistical Pattern Recognition, 2nd ed., John Wiley, West Sussex, England.Wingender, E., Chen, X., Fricke, E., Geffers, R., Hehl, R., Liebich, I., Krull, M., Matys, V., Michael, H., Ohnhäuser,

R., Prüß, M., Schacherer, F., Thiele, S., and Urbach, S. 2001. The TRANSFAC system on gene expression regulation.Nucl. Acids Res. 29, 281–283.

Wolfe, S.A., Greisman, H.A., Ramm, E.I., and Pabo, C.O. 1999. Analysis of zinc fingers optimized via phage display:Evaluating the utility of a recognition code. J. Mol. Biol. 285, 1917–1934.

Xie, X., Lu, J., Kulbokas, E.J., Golub, T.R., Mootha, V., Lindblad-Toh, K., Lander, E.S., and Kellis, M. 2005.Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals.Nature 434(7031), 338–345.

Zhang, Q.M., and Marr, T.G. 1993. A weight array method for splicing signal analysis. Comput. Appl. Biosci. 9(5),499–509.

Zhou, Q., and Liu, J.S. 2004. Modeling within-motif dependence for transcription factor binding site predictions.Bioinformatics 20(6), 909–916.

Zhu, Z., Pilpel, Y., and Church, G.M. 2002. Computational identification of transcription factor binding sites via atranscription-factor-centric clustering (TFCC) algorithm. J. Mol. Biol. 318(1), 71–81.

Address correspondence to:Sumedha Gunewardena

Banting and Best Department of Medical ResearchDonnelly CCBR

University of Toronto160 College St.

Toronto, OntarioM5S 3E1, Canada

E-mail: [email protected]

enhancing the prediction of transcription factor binding sites by incorporating structural...

Documents