yeast protein–protein interaction binding sites: prediction from the motif–motif, motif–domain...

10
2164 Mol. BioSyst., 2010, 6, 2164–2173 This journal is c The Royal Society of Chemistry 2010 Yeast protein–protein interaction binding sites: prediction from the motif–motif, motif–domain and domain–domain levels Erli Pang and Kui Lin* Received 7th June 2010, Accepted 16th July 2010 DOI: 10.1039/c0mb00038h Interacting proteins can contact with each other at three different levels: by a domain binding to another domain, by a domain binding to a short protein motif, or by a motif binding to another motif. In our previous work, we proposed an approach to predict motif–motif binding sites for the yeast interactome by contrasting high-quality positive interactions and high-quality non-interactions using a simple statistical analysis. Here, we extend this idea to more comprehensively infer binding sites, including domain–domain, domain–motif, and motif–motif interactions. In this study, we integrated 2854 yeast proteins that undergo 13 531 high-quality interactions and 3491 yeast proteins undergoing 578 459 high-quality non-interactions. Overall, we found 6315 significant binding site pairs involving 2371 domains and 637 motifs. Benchmarked using the iPfam, DIP CORE, and MIPS, our inferred results are reliable. Interestingly, some of our predicted binding site pairs may, at least in the yeast genome, guide researchers to assay novel protein–protein interactions by mutagenesis or other experiments. Our work demonstrates that by inferring significant protein–protein binding sites at an aggregate level combining domain–domain, domain–motif and motif–motif levels based on high-quality positive and negative datasets, this method may be capable of identifying the binding site pairs that mediate protein–protein interactions. Introduction Proteins usually carry out their functions in a cell by interacting with other molecules such as DNAs, RNAs, and other proteins. In protein–protein interactions (PPIs), two poly- peptides interact with one another through a subset of residues known as a binding site. 1,2 Thus, understanding these binding sites may provide an insight into the molecular and bio- chemical mechanisms of protein interactions. For instance, mutations in the binding sites may induce aberrant inter- actions and lead to abnormal cellular behavior and diseases. 3 In the past few years, various high-throughput experimental methods such as the yeast two-hybrid method (Y2H), 4,5 mass spectrometry (MS), 6 tandem affinity purification (TAP), 7,8 and protein-fragment complementation assay (PCA) 9 have been proposed to identify PPIs. In addition, various kinds of computational methods 10–14 have also been developed to complement the experimental techniques and have produced massive numbers of putative PPIs. To assess the reliability of the potential PPIs indicated either experimentally or compu- tationally, different measurement indices have been proposed, such as the expression profile reliability (EPR), 15 paralogous verification method (PVM), 15 and protein localization method (PLM). 16 It is the development of these technologies and methods, which generate a considerable number of reliable and diverse PPIs, that enable genome-wide scale studies of binding sites mediating protein–protein interactions. Interacting proteins more often contact each other by a domain binding to another domain, by a domain binding to a short protein motif, or by a motif binding to another motif. 17,18 Currently, at the level of domain–domain (DD) interactions, many inference methods have been proposed. For example, the association method 19,20 was developed to define difference measures to identify domain pairs that are observed in interacting proteins more frequently than expected by chance or by non-interacting proteins. Due to the high number of false positives and false negatives produced by the yeast two-hybrid approach, Deng et al. 21 proposed a maximum likelihood estimation (MLE) method that allows the treatment of these errors. Lee et al. 22 extended this idea and used a Bayesian method to integrate evidence and construct a high-confidence domain interaction set. To detect low-propensity, high-specificity domain interactions, Riley et al. 23 proposed the domain pair exclusion analysis (DPEA) method. Other approaches include the parsimonious explanation method (PE) using linear programming optimization, 24 the p-value method, 25 and message-passing algorithms. 26 At the same time, there are also algorithms devised to infer binding motif pairs. 27–30 Most, if not all, of these aforementioned methods infer putative PPI binding sites either at the domain or motif level, respectively. Moreover, with the exception of our previous study, 29 these methods usually use a positive PPI dataset that is carefully assembled and a negative dataset that is randomly sampled. Obviously, with a comprehensive and high-quality PPI network for one genome, it is more reasonable to delineate PPI binding sites from an aggregate level that combines motif–motif (MM), motif–domain (MD), and domain–domain State Key Laboratory of Earth Surface Processes and Resource Ecology and College of Life Sciences, Beijing Normal University, Beijing 100875, China. E-mail: [email protected]; Fax: +86-10-58807721; Tel: +86-10-58805045 PAPER www.rsc.org/molecularbiosystems | Molecular BioSystems Published on 17 August 2010. Downloaded by Monash University on 25/10/2014 13:32:39. View Article Online / Journal Homepage / Table of Contents for this issue

Upload: kui

Post on 28-Feb-2017

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Yeast protein–protein interaction binding sites: prediction from the motif–motif, motif–domain and domain–domain levels

2164 Mol. BioSyst., 2010, 6, 2164–2173 This journal is c The Royal Society of Chemistry 2010

Yeast protein–protein interaction binding sites: prediction from

the motif–motif, motif–domain and domain–domain levels

Erli Pang and Kui Lin*

Received 7th June 2010, Accepted 16th July 2010

DOI: 10.1039/c0mb00038h

Interacting proteins can contact with each other at three different levels: by a domain binding to

another domain, by a domain binding to a short protein motif, or by a motif binding to another

motif. In our previous work, we proposed an approach to predict motif–motif binding sites

for the yeast interactome by contrasting high-quality positive interactions and high-quality

non-interactions using a simple statistical analysis. Here, we extend this idea to more

comprehensively infer binding sites, including domain–domain, domain–motif, and motif–motif

interactions. In this study, we integrated 2854 yeast proteins that undergo 13 531 high-quality

interactions and 3491 yeast proteins undergoing 578 459 high-quality non-interactions. Overall,

we found 6315 significant binding site pairs involving 2371 domains and 637 motifs. Benchmarked

using the iPfam, DIP CORE, and MIPS, our inferred results are reliable. Interestingly, some of

our predicted binding site pairs may, at least in the yeast genome, guide researchers to assay

novel protein–protein interactions by mutagenesis or other experiments. Our work demonstrates

that by inferring significant protein–protein binding sites at an aggregate level combining

domain–domain, domain–motif and motif–motif levels based on high-quality positive and

negative datasets, this method may be capable of identifying the binding site pairs that mediate

protein–protein interactions.

Introduction

Proteins usually carry out their functions in a cell by interacting

with other molecules such as DNAs, RNAs, and other

proteins. In protein–protein interactions (PPIs), two poly-

peptides interact with one another through a subset of residues

known as a binding site.1,2 Thus, understanding these binding

sites may provide an insight into the molecular and bio-

chemical mechanisms of protein interactions. For instance,

mutations in the binding sites may induce aberrant inter-

actions and lead to abnormal cellular behavior and diseases.3

In the past few years, various high-throughput experimental

methods such as the yeast two-hybrid method (Y2H),4,5 mass

spectrometry (MS),6 tandem affinity purification (TAP),7,8 and

protein-fragment complementation assay (PCA)9 have been

proposed to identify PPIs. In addition, various kinds of

computational methods10–14 have also been developed to

complement the experimental techniques and have produced

massive numbers of putative PPIs. To assess the reliability of

the potential PPIs indicated either experimentally or compu-

tationally, different measurement indices have been proposed,

such as the expression profile reliability (EPR),15 paralogous

verification method (PVM),15 and protein localization method

(PLM).16 It is the development of these technologies and

methods, which generate a considerable number of reliable

and diverse PPIs, that enable genome-wide scale studies of

binding sites mediating protein–protein interactions.

Interacting proteins more often contact each other by a

domain binding to another domain, by a domain binding to a

short protein motif, or by a motif binding to another

motif.17,18 Currently, at the level of domain–domain (DD)

interactions, many inference methods have been proposed.

For example, the association method19,20 was developed to

define difference measures to identify domain pairs that are

observed in interacting proteins more frequently than expected

by chance or by non-interacting proteins. Due to the high

number of false positives and false negatives produced by

the yeast two-hybrid approach, Deng et al.21 proposed a

maximum likelihood estimation (MLE) method that allows

the treatment of these errors. Lee et al.22 extended this idea

and used a Bayesian method to integrate evidence and

construct a high-confidence domain interaction set. To detect

low-propensity, high-specificity domain interactions, Riley et al.23

proposed the domain pair exclusion analysis (DPEA) method.

Other approaches include the parsimonious explanation

method (PE) using linear programming optimization,24 the

p-value method,25 and message-passing algorithms.26 At the

same time, there are also algorithms devised to infer binding

motif pairs.27–30 Most, if not all, of these aforementioned

methods infer putative PPI binding sites either at the domain

or motif level, respectively. Moreover, with the exception of

our previous study,29 these methods usually use a positive PPI

dataset that is carefully assembled and a negative dataset that

is randomly sampled.

Obviously, with a comprehensive and high-quality PPI

network for one genome, it is more reasonable to delineate

PPI binding sites from an aggregate level that combines

motif–motif (MM), motif–domain (MD), and domain–domain

State Key Laboratory of Earth Surface Processes and ResourceEcology and College of Life Sciences, Beijing Normal University,Beijing 100875, China. E-mail: [email protected];Fax: +86-10-58807721; Tel: +86-10-58805045

PAPER www.rsc.org/molecularbiosystems | Molecular BioSystems

Publ

ishe

d on

17

Aug

ust 2

010.

Dow

nloa

ded

by M

onas

h U

nive

rsity

on

25/1

0/20

14 1

3:32

:39.

View Article Online / Journal Homepage / Table of Contents for this issue

Page 2: Yeast protein–protein interaction binding sites: prediction from the motif–motif, motif–domain and domain–domain levels

This journal is c The Royal Society of Chemistry 2010 Mol. BioSyst., 2010, 6, 2164–2173 2165

(DD) levels. To this end, we extended our previous computa-

tional idea29 and attempted to develop an integrative frame-

work for PPI binding sites in the yeast genome. We integrated

reliable, diverse and non-redundant experimental PPIs from

extant studies31 and BioGRID32 into a positive PPI dataset

with high-quality, which is called the gold standard positive set

(GSPs). The set of negative PPIs is obtained by our previous

computational method,14,33 which is called the gold standard

negative set (GSNs) and is shown to be more reliable than

other negative interaction sets. To identify the MM, MD and

DD pairs that are statistically overrepresented in the GSPs

compared with the GSNs, the one-tailed Fisher’s exact test

was applied. A multiple testing technique was also adopted to

minimize the false discovery rate (FDR). To assess the

performance of our predicted results, two different types of

measurements were used. The first one, the sensitivity (SN),

attempts to find the enrichment of our predicted binding sites

in the validation datasets. The second one, the positive

predictive value (PPV), focuses on how many predicted

binding sites are correct based on a given PPI dataset. Our

results show satisfactory performance, as assessed by the SN

and PPV measures. At the same time, we have also compared

our method with three other approaches at the DD interaction

level, and we have found that our algorithm infers many more

reliable interactions than others. Taken together, these data

demonstrate that, by simultaneously inferring significant

protein–protein binding sites from the aggregate level based

on high-quality positive and negative datasets, our simple

method can efficiently identify significant and comprehensive

binding site pairs that mediate PPIs. These inferred binding

site pairs at the aggregate level should be helpful for the

understanding of the mechanism of protein–protein inter-

actions and insightful into protein functions, diseases and

genomic evolution.34

Results

Our approach involves four steps: (1) generating GSPs and

GSNs datasets; (2) annotating the protein sequences with

domains and motifs; (3) deriving two-by-two tables of MM,

MD and DD pairs; (4) detecting significantly over-represented

pairs by carefully controlling the false discovery rate (FDR).

In this section, we will describe the results of this approach by

applying it to the GSPs and the GSNs we derived.

Significantly over-represented motif–motif, motif–domain, and

domain–domain pairs

Within the yeast genome we studied, the GSPs consist of

13 531 interactions that involve 2854 proteins and the GSNs

comprise 578 459 non-interactions from 3491 proteins. There

are 4435 non-redundant proteins in total. Each of these

non-redundant proteins was annotated at the domain level

using Pfam (Pfam release 23.0) and at the motif level using

Prosite (version 20.44). Through this annotation process, 2371

(PfamA) domains and 637 motifs were assigned, respectively.

To detect MM, MD and DD pairs that were significantly

over-represented in the GSPs compared with the GSNs, the

one-tailed Fisher’s exact test was applied, where the p-value

represents a measure of significance in terms of the false

positive rate. The q-value method was employed to identify

as many significantly over-represented pairs as possible

(see Materials and methods) while incurring a relatively low

FDR. In total, we obtained 6315 significant binding site pairs

with q-values less than 0.01, including 1124 MM, 3856 MD,

and 1335 DD interactions (Table 1).

Validation of the inferred domain–domain interactions

The database of iPfam consists of experimental DD inter-

actions derived from the protein structures in the Protein Data

Bank (PDB).35 Due to its high reliability, we used iPfam

within Pfam (Pfam release 21.0) to assess the quality of our

inferred DD interactions. Only DD interactions assigned to

different proteins for yeast were used. Thus, we obtained

129 interactions involving 90 domains within iPfam. For these

90 domains, we inferred 191 interactions, of which 57 over-

lapped with iPfam interactions. More interestingly, we found

11 interactions (listed in Table 2) curated in iPfam that they

are either from proteins of other organisms or from intra-

protein interactions within yeast.

Proteins interact while participating in the same cellular

process, i.e. sharing a common function. This is usually used

to assess the reliability of PPIs.36,37 Similarly, we expected that

two components in a DD interaction pair might share a

common Gene Ontology (GO) annotation. We downloaded

the GO annotations of domains from Pfam23.0. For the 123

domain interactions without iPfam validation, three of them

have only one domain with GO annotation, and 108 have

either common molecular function or biological process GO

annotations. For the remaining 13 domain interactions, we

investigated their GO structures and found that six of them

have annotations of one domain that is the parent of the other

Table 1 The number of inferred binding site pairs based on the GSPsand the GSNs

Type Number Binding pairs

Motif 637 1124 (MM)Motif and domain 637 motifs and

2371 domains3856 (MD)

Domain 2371 1335 (DD)Aggregate levels 3008 6315 (MM/MD/DD)

The first column indicates the type of the binding sites. The second

column is the number of corresponding types, and the third column is

the number of binding site pairs at the corresponding level.

Table 2 Interactions confirmed by iPfam from other organisms orfrom intra-protein interactions for yeast

PfamA accession PfamA accession Species

PF00071 PF03810 Homo sapiensPF00071 PF00071 Mus musculusPF00018 PF00018 Rattus norvegicusPF01000 PF04983 Thermus aquaticusPF00626 PF04815 S. cerevisiaePF04811 PF04815 S. cerevisiaePF01192 PF04983 T. aquaticusPF00626 PF04811 S. cerevisiaePF02463 PF02463 Pyrococcus furiosusPF00626 PF00626 Oryctolagus cuniculusPF04563 PF04998 P. furiosus

Publ

ishe

d on

17

Aug

ust 2

010.

Dow

nloa

ded

by M

onas

h U

nive

rsity

on

25/1

0/20

14 1

3:32

:39.

View Article Online

Page 3: Yeast protein–protein interaction binding sites: prediction from the motif–motif, motif–domain and domain–domain levels

2166 Mol. BioSyst., 2010, 6, 2164–2173 This journal is c The Royal Society of Chemistry 2010

domain. We also found one domain interaction whose

members have a common parent, as illustrated in Fig. 1. Thus,

all of these validations may indicate that, at the DD level, our

results are reliable and provide high coverage because most of

the predicted domain interactions are confirmed either by

iPfam, or by GO annotations. This suggests that our results

may include more reliable DD interactions than other

approaches, such as iPfam. These newly inferred interactions

may provide biologists with testable hypotheses about novel

PPIs in yeast.

More comprehensive binding sites from the viewpoint

of protein–protein interactions

Due to the very low coverage of MM or MD experimental

binding site data, a systematic direct validation for MM and

MD interactions is not possible as it is with DD interactions.

Here, we performed validation for these two types of interactions

at the PPI level using the MIPS and DIP CORE datasets. For

MIPS, we excluded PPIs predicted by two-hybrid and obtained

4311 protein interactions. These interactions consist of 1686

proteins involving 1570 motifs and domains. For the DIP CORE

dataset, there are 3251 interactions comprised of 1906 proteins

involving 1668 motifs and domains.

The SN and the PPV were calculated for each validation

dataset. We also compared their SN and PPV values with

those of the randomized binding site pair sets. Table 3 lists the

validation results benchmarked on the two datasets. The SNs

and PPVs of the MD interactions are relatively lower

compared with those of the MM interactions. The reason

may be that motifs are short and may randomly occur in more

proteins than domains do. To examine if this is the case, we

also mapped the DD interactions on the PPIs and calculated

the SNs and the PPVs (For MIPS, the SN and the PPV are

0.139 and 0.371, respectively; for DIP CORE, the SN and the

PPV are 0.337 and 0.606, respectively.). Although both SNs

and PPVs are significantly higher than expected at random,

they are much lower than those of the MM and MD inter-

actions. Although the SNs and the PPVs are much higher than

those of the sets of randomly chosen interaction pairs, the SNs

of MM interactions are not statistically significant. An

Fig. 1 Distribution of predicted domain–domain interactions. The pie chart represents the number of predicted domain–domain interactions and

the domains that are shared with iPfam.

Table 3 The sensitivities (SNs) and the positive predictive values(PPVs) of inferred and randomized binding site pairs

Validation dataset Interactions SN PPV

MIPS Predicted MM interactions 0.790 0.813Randomized MM interactions 0.306 0.053Predicted MD interactions 0.608 0.763Randomized MD interactions 0.101 0.030

DIP CORE Predicted MM interactions 0.797 0.861Randomized MM interactions 0.265 0.048Predicted MD interactions 0.753 0.847Randomized MD interactions 0.103 0.026

The first column is the validation dataset; the second column is the

type of interactions; the third column is the value of sensitivity (SN)

calculated as TP/(TP + FN); and the fourth column is the positive

predictive value (PPV) calculated as TP/(TP + FP), where TP

represents true positives, FN represents false negatives and FP repre-

sents false positives, were estimated with respect to each validation.

Here SN was defined as the fraction of protein interactions containing

the predicted binding site pairs in the validation dataset, and PPV was

defined as the fraction of predicted pairs mapped on the validation

dataset.

Publ

ishe

d on

17

Aug

ust 2

010.

Dow

nloa

ded

by M

onas

h U

nive

rsity

on

25/1

0/20

14 1

3:32

:39.

View Article Online

Page 4: Yeast protein–protein interaction binding sites: prediction from the motif–motif, motif–domain and domain–domain levels

This journal is c The Royal Society of Chemistry 2010 Mol. BioSyst., 2010, 6, 2164–2173 2167

explanation may be that promiscuous motifs occur in a

majority of proteins. Indeed, after removal of the promiscuous

motifs, the SNs of MM interactions are significant for both

datasets (see Discussion).

Intuitively, the coverage of the binding sites for the aggre-

gate level must be higher than that of each individual level. In

addition, we expect that the PPV of the aggregate results

should not decrease much. If this is the case, over prediction

may have occurred; however, Fig. 2 shows that the SNs of

the aggregate level are vastly improved without greatly

affecting the PPVs. This demonstrates that the predicted

binding sites at the aggregate level give a more comprehensive

characterization of protein–protein interactions than does any

individual level.

Fig. 2 The sensitivity (SN) defined as TP/(TP + FN) (A) and the positive predictive value (PPV) (B) defined as TP/(TP + FP) estimated by DIP

CORE andMIPS. MM,MD and DD along x-axis refer to motif–motif interactions, motif–domain interactions, and domain–domain interactions,

respectively. The three levels refer to the interactions of the combined motif–motif, motif–domain and domain–domain levels.

Publ

ishe

d on

17

Aug

ust 2

010.

Dow

nloa

ded

by M

onas

h U

nive

rsity

on

25/1

0/20

14 1

3:32

:39.

View Article Online

Page 5: Yeast protein–protein interaction binding sites: prediction from the motif–motif, motif–domain and domain–domain levels

2168 Mol. BioSyst., 2010, 6, 2164–2173 This journal is c The Royal Society of Chemistry 2010

Understanding the effects of mutations

If a mutation occurs within a protein binding site, the corres-

ponding interaction may be disrupted or their binding affinity

may decrease, and such a mutation event would affect the

normal biological processes of the cell. Our predictions of MD

interactions may suggest both possible mechanisms for such

aberrant interactions.

For example, the predicted motif PS00006 (position 7–10) of

TOP2_YEAST (Uniprot:P06786) is involved in binding to the

domain PF00130 of KPC1_YEAST (Uniprot:P24583). Inter-

estingly, Mouchel and Jenkins38 used a yeast two-hybrid

protein interaction screen and identified an interaction

between the catalytic domain of the yeast protein kinase 1

enzyme (Pkc1) and amino acids 3–37 of the Saccharomyces

cerevisiae topoisomerase II. When they mutated the amino

acids 9 and 10 of the yeast N-terminal topoisomerase II, they

failed to detect the interaction as before. As described in

Fig. 3A, this is consistent with our prediction that a mutation

in the motif PS00006 (position 7–10) might disrupt interaction.

Another example is that the predicted domain PF07653

(position 304–359) on SSU81_YEAST (Uniprot:P40073) may

be involved in binding the motif PS00007 (position 207–214) on

FUS1_YEAST. Nelson et al.39 identified the interaction

through domain SH3_2 on SSU81_YEAST and a proline-rich

peptide ligand on the Fus1p COOH-terminal cytoplasmic

region (95–512) in a two-hybrid screen. When they mutated

amino acid 352 of SSU81_YEAST, the interaction was

suppressed. This experiment further supports our prediction

results of the MD interaction (Fig. 3B).

Discussion

Proteins interact and make contact with each other at binding

sites. Previous related methods have been developed to detect

protein binding sites either at the motif or the domain level

individually. To the best of our knowledge, this is the first

study to simultaneously infer protein binding sites by combining

three levels of interactions, MM, MD and DD, on a genome

wide scale. It is helpful to address some differences between

our predicted results and those of other approaches.

Comparison with other prediction methods

Due to their importance in understanding protein–protein

interactions, domain–domain interactions have been extensively

computationally investigated. For example, many putative

Fig. 3 Illustration of the effects of mutations on interaction site predictions. (A) Motif 1 is mutated to motif 10. (B) One amino acid in domain 3 is

mutated. Proteins are depicted approximately to scale. Sites of motifs and domains are listed by their approximate positions. Interaction sites are

indicated by arrows, and the extent of thickness indicates the strength of the interactions.

Publ

ishe

d on

17

Aug

ust 2

010.

Dow

nloa

ded

by M

onas

h U

nive

rsity

on

25/1

0/20

14 1

3:32

:39.

View Article Online

Page 6: Yeast protein–protein interaction binding sites: prediction from the motif–motif, motif–domain and domain–domain levels

This journal is c The Royal Society of Chemistry 2010 Mol. BioSyst., 2010, 6, 2164–2173 2169

DD interactions have been obtained by the DPEA method,23

by the Bayesian network model and the MLE method,22 or by

the PE method.24 All of these methods generate confidence

measures of domain pairs. The DPEA method uses the value

of Eij to assess the likelihood of highly special domain inter-

actions in multiple organisms. Lee et al. proposed a means to

measure the expected number of each pair of domains from

four species, and couple this with integrated evidence of Gene

Ontology annotation and domain fusion using a Bayesian

approach. Guimaraes et al. derived a value called the ‘linear

program (LP)-score’ using linear programming optimization

for domain pairs from multiple organisms.

Here, we compared our inferred interacting domain pairs

with their predictions. We extracted the common yeast

Fig. 4 The number of interaction domains shared with iPfam (A) and the sensitivity (SN) (B) for the four methods.

Publ

ishe

d on

17

Aug

ust 2

010.

Dow

nloa

ded

by M

onas

h U

nive

rsity

on

25/1

0/20

14 1

3:32

:39.

View Article Online

Page 7: Yeast protein–protein interaction binding sites: prediction from the motif–motif, motif–domain and domain–domain levels

2170 Mol. BioSyst., 2010, 6, 2164–2173 This journal is c The Royal Society of Chemistry 2010

domains shared by the iPfam and the other prediction

methods, respectively. Fig. 4A shows the number of DD

interactions predicted by the four methods and the number

of interactions confirmed by iPfam (Pfam release 21.0),

respectively. Fig. 4B depicts the SNs of four methods

estimated by iPfam. As shown, Riley et al. predict the lowest

number of putative DD interactions, with only 19 pairs under-

lying 34 domains and eight of 19 pairs in common with iPfam.

Except for the PPV of Riley et al. (42% vs. 30% of ours), all

PPVs and SNs are smaller than those of ours. The SN of

Riley et al. (47%) is lower than that of our method (77%). The

PPV and the SN of Guimaraes et al. are 33% and 37%,

respectively. The PPV and the SN of Lee et al. are 23% and

36%, respectively. Overall, this suggests that our method may

perform better than the three methods at inferring domain–

domain interactions for the yeast.

Mapping predicted binding sites on the GSPs

When the predicted binding sites were mapped onto the GSPs,

we found that SN was 96%. Why do we miss the remaining

4% of the PPIs (548) in the GSPs and what do they look like?

The answer to these questions may be that there are more rare

domains and/or motifs occurring within these proteins. From

the 4% PPIs of the GSPs, we extracted 568 different proteins

containing 692 motifs and domains. Of these domains and

motifs, there are 325 domains and motifs occurring in only one

protein, and 102 domains and motifs occurring only in two

proteins, accounting for 62% of the total motifs and domains

(427/692). Because our method is ineffective in detecting rare

domain and/or motif interactions, due to the weakness of the

one-tailed Fisher’s exact test, these PPIs may not be mapped

by this methodology.

Inference performance is affected by promiscuous motifs

To examine whether the set of predicted binding site pairs are

mapped onto the evaluation datasets at a significantly higher

rate than the set of randomly chosen pairs, empirical p-values

for the SN and the PPV of each evaluation dataset were

calculated. Nevertheless, in the randomization process, we

found that the SNs of the MM level and the aggregate level

for the evaluation datasets (Table 4) are not significant. This

may be because those motifs occur in the majority of proteins

(promiscuous) and are involved in post-translational modifi-

cations. To test this hypothesis, after excluding the 12 motifs

(listed in Table 5) with a high probability of occurrence during

the randomization process, all the SNs for the evaluation

datasets become significant. This indicates that our method

could detect the pairs with frequencies that are significantly

higher in the GSPs than in GSNs, rather than the pairs present

in the GSPs and missing in the GSNs.

Some caveats

There are several limitations to our current approach: (1) We

only inferred binding sites of PPIs for the yeast genome, due to

a lack of high-quality non-interactions for other species.

(2) Obviously, the prediction power of our method is limited

by the quality and coverage of protein interactions and

non-interactions. (3) Our method is devised to find signifi-

cantly over-represented MM, MD and DD pairs in the

interaction proteins compared with the non-interaction

Table 4 Empirical p-values for the SNs and the PPVs within each evaluation dataset

Validation dataset InteractionsP-value including thepromiscuous motifs

P-value eliminating thepromiscuous motifs

MIPS MM interactions 0.166 0The aggregate level interactions 0.022 0

DIP CORE MM interactions 0.112 0The aggregate level interactions 0.018 0

The empirical p-value is defined as the ratio of the times that the PPV (or SN) of randomized interactions was equal to or greater than that of the

predicted interactions.

Table 5 Motifs with a high probability of occurrence

Motif Pattern Description

PS00001 N-{P}-[ST]-{P} N-glycosylation sitePS00004 [RK](2)-x-[ST] cAMP- and cGMP-dependent protein kinase

phosphorylation sitePS00005 [ST]-x-[RK] Protein kinase C phosphorylation sitePS00006 [ST]-x(2)-[DE] Casein kinase II phosphorylation sitePS00007 [RK]-x(2,3)-[DE]-x(2,3)-Y Tyrosine kinase phosphorylation sitePS00008 G-{EDRKHPFYW}-x(2)-[STAGCN]-{P} N-myristoylation sitePS00009 x-G-[RK]-[RK] Amidation sitePS00016 R-G-D Cell attachment sequencePS00017 [AG]-x(4)-G-K-[ST] ATP/GTP-binding site motif A (P-loop)PS00029 L-x(6)-L-x(6)-L-x(6)-L Leucine zipper patternPS00294 C-{DENQ}-[LIVM]-x> Prenyl group binding site (CAAX box)PS00342 [STAGCN]-[RKH]-[LIVMAFY]> Microbodies C-terminal targeting signal

The first column is the Prosite accession number of the motif. The second column is the pattern of the motif, and the third column is the description

of the motif.

Publ

ishe

d on

17

Aug

ust 2

010.

Dow

nloa

ded

by M

onas

h U

nive

rsity

on

25/1

0/20

14 1

3:32

:39.

View Article Online

Page 8: Yeast protein–protein interaction binding sites: prediction from the motif–motif, motif–domain and domain–domain levels

This journal is c The Royal Society of Chemistry 2010 Mol. BioSyst., 2010, 6, 2164–2173 2171

proteins and, as discussed before, it is unlikely to detect

interactions with rare motifs and domains. (4) Compared with

other methods that ignore the gaps between the domains, our

method used an aggregate level of interactions combining

MM, MD, and DD levels. Nevertheless, not all the motifs in

the Prosite dataset mediate protein interactions, and our

method does not deal with this case.

Materials and methods

Data sources

In this study, we used the S. cerevisiae genome and the related

protein interactions, non-interacting proteins, and domain and

motif information. Three independent public datasets, iPfam,

DIP CORE andMIPS, were used for performance assessment.

Yeast genome. We downloaded the yeast genome from the

Saccharomyces Genome Database (SGD, June 2006 version)

(http://yeastgenome.org/). It contains 5884 open reading

frames (ORFs), of which 5873 ORFs have the Swiss-Prot

accession numbers.

Positive interaction dataset. We constructed a GSPs for

yeast proteins as Wu et al.40 integrated. We downloaded the

literature-curated (LC) and high-throughput (HTP) physical

protein–protein datasets, as previously reported,31 and

excluded all the intra-protein interactions. The LC dataset

was directly adopted. The HTP and the other five HTP

interaction datasets from BioGRID32 (version 2.0.33) were

filtered. The HTP interactions that satisfied at least one of the

following criteria were assembled into our interaction dataset:

(1) Both A–B and B–A interactions are recorded in the HTP

datasets. (2) The interaction is from more than one different

literatures. (3) The interaction is derived from more than one

different experimental methods. (4) The RSS values14,33 on

both biological processes (BP) and cellular components (CC)

ontologies are greater than or equal to 0.8. Using these

criteria, we obtained 13 531 interactions underlying 2854

proteins with at least one domain.

Negative interaction dataset. The GSNs for yeast was

obtained from SPIDer database14,33 in which the RSS values

on BP are within the range of [0, 0.4] and the RSS values on CC

are within the range of [0, 0.5]. Overall, we obtained 578 459

non-interactions from 3491 proteins with at least one domain.

Domain assignment. The protein–domain relationships for

yeast proteins were obtained from Pfam.41 Pfam is a collection

of protein domains and families, represented as multiple

sequence alignments and as profile hidden Markov models.

As of July 2008, Pfam23.0 (http://pfam.sanger.ac.uk/) consists

of 10 340 families. The Swiss-Prot accession numbers indicate

that there are 4435 ORFs with at least one domain. In total,

2371 domains were assigned.

Motif assignment. Motif sequences were obtained from

the Prosite database.42 The Prosite database is a

collection of patterns. As of March 2009, Prosite 20.44

(http://www.expasy.ch/prosite/) consists of 1313 patterns.

We used the ScanProsite tool43 to scan protein sequences

against the Prosite database. There are 637 motifs assigned

within the 4435 selected ORFs.

Evaluation datasets. Three datasets were used to validate our

results, including iPfam,35 the S. cerevisiae core subset of DIP

(DIP CORE)44 that has passed a quality assessment and

MIPS.45 Only the interactions for yeast were extracted. For

iPfam (Pfam release 21.0), we only picked the domain inter-

actions assigned to different proteins, and we obtained 129

interaction domains. We obtained 3251 interactions with

different proteins, called inter-protein interactions, from DIP

CORE (October 2008 version). We extracted the inter-protein

interactions from MIPS (May 2005 version), excluding the

two-hybrid interactions, and obtained 4311 interactions.

Predicting binding site pairs at motif–motif, motif–domain and

domain–domain levels

To detect MM, MD and DD pairs that are significantly

over-represented in GSPs compared with GSNs, the one-tailed

Fisher’s exact test was used. In addition, the q-value method

was applied to correct the false discovery rate (FDR) of

multiple testing. For the inferred binding site pairs, the

independent datasets were used for validation.

One-tailed Fisher’s exact test for determining interaction

pairs that occur in a significant number of the GSPs. The

one-tailed Fisher’s exact test was used to determine the pairs

that occur more often in the GSPs than in the GSNs. A

two-by-two frequency table was constructed for each pair, in

which the two rows represent the GSPs and GSNs, and the

two columns represent the number of protein pairs including

predicted interaction pairs, and those not including the given

pairs. P-values were calculated using the R Statistics package.

Multiple testing. Using the R Statistics package, written by

Dabney and Storey,46 we obtained a q-value for each MM,

MD and DD pair. P-value is a measure of significance

maintaining type I error on a certain level.46 The q-value is

similar to the p-value, but the q-value is a measure of signi-

ficance in terms of the FDR.47 The QVALUE software

(http://www.genomics.princeton.edu/storeylab/qvalue/linux.html)

was used to adjust the p-values to their corresponding q-values.

Evaluating the predicted interaction pairs. To validate the

quality of the binding site pairs, we used two measures: the

fraction of interacting proteins containing the predicted

binding site pairs in the validation dataset, i.e. the sensitivity

(SN), and the fraction of predicted binding site pairs mapped

on the validation dataset, i.e. the positive predictive value

(PPV). The sensitivity of a test is used to measure the ability to

detect true positives. We calculated the SN and PPV of our

inferred results for each evaluation dataset.

A reasonable prediction approach should contain more

inferred binding site pairs mapped on the evaluation datasets

than expected at random. To verify this, we calculated the

PPV and the SN of randomized pairs for each evaluation

dataset. First, we extracted domains and motifs from an

evaluation dataset and grouped them into an object set. Then,

we randomly extracted N pairs from the object set and

grouped them into a randomized binding site pair set, where

Publ

ishe

d on

17

Aug

ust 2

010.

Dow

nloa

ded

by M

onas

h U

nive

rsity

on

25/1

0/20

14 1

3:32

:39.

View Article Online

Page 9: Yeast protein–protein interaction binding sites: prediction from the motif–motif, motif–domain and domain–domain levels

2172 Mol. BioSyst., 2010, 6, 2164–2173 This journal is c The Royal Society of Chemistry 2010

N is the number of predicted pairs from the object set. Finally,

we calculated the SN and PPV of the randomized pair set for

the evaluation dataset. This randomization process was

repeated 1000 times, and the average of PPV (or SN) from

1000 replicates was obtained. We can simultaneously obtain

an empirical p-value for each evaluation dataset. For each of

1000 randomization simulations, it gives one PPV (or SN).

Then, the p-value is defined as how often these PPVs (or SNs)

are equal to or greater than the observed PPV (or SN).

Conclusions

It is known that proteins containing predicted binding sites

may not actually interact with each other. Although our

results cannot be used to predict PPIs directly, we believe that

in combination with other evidence, such as protein profiles

and protein localization, our results should help in predicting

putative PPIs, in particular in other species. More interestingly

and importantly, our results could give biologists clues to

assay potential protein interactions via mutation experiments.

However, we should note that our method is ineffective for

detecting interactions with rare motifs and/or domains. This

caveat may be mitigated with DPEA developed by Riley

et al.23 or be tackled by any new methods being proposed in

future. We will also update and extend our previous online

database, SPIDer,33 to present the useful protein binding site

information. In summary, all of these tools should provide a

better understanding of protein–protein interaction mecha-

nisms, which is important for insights into protein functions,

diseases and genomic evolution.

Abbreviations

DD domain–domain

DM motif–domain

FDR false discovery rate

GSNs gold standard negative set

GSPs gold standard positive set

MM motif–motif

PPIs protein–protein interactions

PPV positive predictive value

SN sensitivity

Acknowledgements

We thank two anonymous reviewers for their constructive

comments. This work was supported by State Key Laboratory

of Earth Surface Processes and Resource Ecology and by

NSFC (Grant 30770445).

References

1 O. Keskin, N. Tuncbag and A. Gursoy, Curr. Pharm. Biotechnol.,2008, 9, 67–76.

2 N. Tuncbag, G. Kar, O. Keskin, A. Gursoy and R. Nussinov,Briefings Bioinf., 2009, 10, 217–232.

3 M. G. Kann, Briefings. Bioinf., 2007, 8, 333–346.4 P. Uetz, L. Giot, G. Cagney, T. A. Mansfield, R. S. Judson,J. R. Knight, D. Lockshon, V. Narayan, M. Srinivasan,P. Pochart, A. Qureshi-Emili, Y. Li, B. Godwin, D. Conover,

T. Kalbfleisch, G. Vijayadamodar, M. J. Yang, M. Johnston,S. Fields and J. M. Rothberg, Nature, 2000, 403, 623–627.

5 T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori andY. Sakaki, Proc. Natl. Acad. Sci. U. S. A., 2001, 98, 4569–4574.

6 Y. Ho, A. Gruhler, A. Heilbut, G. D. Bader, L. Moore,S. L. Adams, A. Millar, P. Taylor, K. Bennett, K. Boutilier,L. Y. Yang, C. Wolting, I. Donaldson, S. Schandorff,J. Shewnarane, M. Vo, J. Taggart, M. Goudreault, B. Muskat,C. Alfarano, D. Dewar, Z. Lin, K. Michalickova, A. R. Willems,H. Sassi, P. A. Nielsen, K. J. Rasmussen, J. R. Andersen,L. E. Johansen, L. H. Hansen, H. Jespersen, A. Podtelejnikov,E. Nielsen, J. Crawford, V. Poulsen, B. D. Sorensen, J. Matthiesen,R. C. Hendrickson, F. Gleeson, T. Pawson, M. F. Moran,D. Durocher, M. Mann, C. W. V. Hogue, D. Figeys andM. Tyers, Nature, 2002, 415, 180–183.

7 G. Rigaut, A. Shevchenko, B. Rutz, M. Wilm, M. Mann andB. Seraphin, Nat. Biotechnol., 1999, 17, 1030–1032.

8 O. Puig, F. Caspary, G. Rigaut, B. Rutz, E. Bouveret, E. Bragado-Nilsson, M. Wilm and B. Seraphin, Methods, 2001, 24, 218–229.

9 K. Tarassov, V. Messier, C. R. Landry, S. Radinovic, M. M. S.Molina, I. Shames, Y. Malitskaya, J. Vogel, H. Bussey andS. W. Michnick, Science, 2008, 320, 1465–1470.

10 T. Dandekar, B. Snel, M. Huynen and P. Bork, Trends Biochem.Sci., 1998, 23, 324–328.

11 A. J. Enright, I. Iliopoulos, N. C. Kyrpides and C. A. Ouzounis,Nature, 1999, 402, 86–90.

12 C. S. Goh, A. A. Bogan, M. Joachimiak, D. Walther andF. E. Cohen, J. Mol. Biol., 2000, 299, 283–293.

13 E. M. Marcotte, M. Pellegrini, H. L. Ng, D. W. Rice, T. O. Yeatesand D. Eisenberg, Science, 1999, 285, 751–753.

14 X. M. Wu, L. Zhu, J. Guo, D. Y. Zhang and K. Lin, Nucleic AcidsRes., 2006, 34, 2137–2150.

15 C. M. Deane, L. Salwinski, I. Xenarios and D. Eisenberg, Mol.Cell. Proteomics, 2002, 1, 349–356.

16 E. Sprinzak, S. Sattath and H. Margalit, J. Mol. Biol., 2003, 327,919–923.

17 T. Pawson and P. Nash, Science, 2003, 300, 445–452.18 P. Puntervoll, R. Linding, C. Gemund, S. Chabanis-Davidson,

M. Mattingsdal, S. Cameron, D. M. A. Martin, G. Ausiello,B. Brannetti, A. Costantini, F. Ferre, V. Maselli, A. Via,G. Cesareni, F. Diella, G. Superti-Furga, L. Wyrwicz, C. Ramu,C. McGuigan, R. Gudavalli, I. Letunic, P. Bork, L. Rychlewski,B. Kuster, M. Helmer-Citterich, W. N. Hunter, R. Aasland andT. J. Gibson, Nucleic Acids Res., 2003, 31, 3625–3630.

19 E. Sprinzak and H. Margalit, J. Mol. Biol., 2001, 311, 681–692.20 S. K. Ng, Z. Zhang and S. H. Tan, Bioinformatics, 2003, 19,

923–929.21 M. H. Deng, S. Mehta, F. Z. Sun and T. Chen, Genome Res., 2002,

12, 1540–1548.22 H. Lee, M. H. Deng, F. Z. Sun and T. Chen, BMC Bioinf, 2006, 7,

269.23 R. Riley, C. Lee, C. Sabatti and D. Eisenberg, Genome Biol., 2005,

6, R89.24 K. S. Guimaraes, R. Jothi, E. Zotenko and T. M. Przytycka,

Genome Biol., 2006, 7, R104.25 T. M. W. Nye, C. Berzuini, W. R. Gilks, M. M. Babu and

S. A. Teichmann, Bioinformatics, 2005, 21, 993–1001.26 M. Iqbal, A. A. Freitas, C. G. Johnson and M. Vergassola,

Bioinformatics, 2008, 24, 2064–2070.27 H. Q. Li and J. Y. Li, Bioinformatics, 2005, 21, 314–324.28 S. H. Tan, W. Hugo, W. K. Sung and S. K. Ng, BMC Bioinf., 2006,

7, 502.29 J. Guo, X. M. Wu, D. Y. Zhang and K. Lin, Nucleic Acids Res.,

2008, 36, 2002–2011.30 H. D. Wang, E. Segal, A. Ben-Hur, Q. R. Li, M. Vidal and

D. Koller, Genome Biol., 2007, 8, R192.31 T. Reguly, A. Breitkreutz, L. Boucher, B.-J. Breitkreutz,

G. C. Hon, C. L. Myers, A. Parsons, H. Friesen, R. Oughtred,A. Tong, C. Stark, Y. Ho, D. Botstein, B. Andrews, C. Boone,O. G. Troyanskya, T. Ideker, K. Dolinski, N. N. Batada andM. Tyers, J. Biol., 2006, 5, 11.

32 C. Stark, B. J. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutzand M. Tyers, Nucleic Acids Res., 2006, 34, D535–D539.

33 X. M. Wu, L. Zhu, J. Guo, C. Fu, H. J. Zhou, D. Dong, Z. B. Li,D. Y. Zhang and K. Lin, BMC Bioinf., 2006, 7, S16.

Publ

ishe

d on

17

Aug

ust 2

010.

Dow

nloa

ded

by M

onas

h U

nive

rsity

on

25/1

0/20

14 1

3:32

:39.

View Article Online

Page 10: Yeast protein–protein interaction binding sites: prediction from the motif–motif, motif–domain and domain–domain levels

This journal is c The Royal Society of Chemistry 2010 Mol. BioSyst., 2010, 6, 2164–2173 2173

34 B. J. Mayer, J. Cell Sci., 2001, 114, 1253–1263.35 R. D. Finn, M. Marshall and A. Bateman, Bioinformatics, 2005,

21, 410–412.36 B. Schwikowski, P. Uetz and S. Fields, Nat. Biotechnol., 2000, 18,

1257–1261.37 C. von Mering, R. Krause, B. Snel, M. Cornell, S. G. Oliver,

S. Fields and P. Bork, Nature, 2002, 417, 399–403.38 N. A. P. Mouchel and J. R. Jenkins, FEBS Lett., 2006, 580, 51–57.39 B. Nelson, A. B. Parsons, M. Evangelista, K. Schaefer,

K. Kennedy, S. Ritchie, T. L. Petryshen and C. Boone, Genetics,2004, 166, 67–77.

40 X. M. Wu, J. Guo, D. Y. Zhang and K. Lin, Proteomics, 2009, 9,4812–4824.

41 R. D. Finn, J. Tate, J. Mistry, P. C. Coggill, S. J. Sammut,H. R. Hotz, G. Ceric, K. Forslund, S. R. Eddy, E. L. L.

Sonnhammer and A. Bateman, Nucleic Acids Res., 2008, 36,D281–D288.

42 N. Hulo, A. Bairoch, V. Bulliard, L. Cerutti, E. De Castro,P. S. Langendijk-Genevaux, M. Pagni and C. J. A. Sigrist, NucleicAcids Res., 2006, 34, D227–D230.

43 A. Gattiker, E. Gasteiger and A. Bairoch, Appl. Bioinf., 2002, 1,107–108.

44 I. Xenarios, L. Salwinski, X. Q. J. Duan, P. Higney, S. M. Kim andD. Eisenberg, Nucleic Acids Res., 2002, 30, 303–305.

45 U. Guldener, M. Munsterkotter, M. Oesterheld, P. Pagel,A. Ruepp, H. W. Mewes and V. Stumpflen, Nucleic Acids Res.,2006, 34, D436–D441.

46 J. D. Storey, J. R. Stat. Soc., Ser. B: Stat.Methodol., 2002, 64, 479–498.47 J. D. Storey and R. Tibshirani, Proc. Natl. Acad. Sci. U. S. A.,

2003, 100, 9440–9445.

Publ

ishe

d on

17

Aug

ust 2

010.

Dow

nloa

ded

by M

onas

h U

nive

rsity

on

25/1

0/20

14 1

3:32

:39.

View Article Online