yeast protein–protein interaction binding sites: prediction from the motif–motif, motif–domain...
TRANSCRIPT
2164 Mol. BioSyst., 2010, 6, 2164–2173 This journal is c The Royal Society of Chemistry 2010
Yeast protein–protein interaction binding sites: prediction from
the motif–motif, motif–domain and domain–domain levels
Erli Pang and Kui Lin*
Received 7th June 2010, Accepted 16th July 2010
DOI: 10.1039/c0mb00038h
Interacting proteins can contact with each other at three different levels: by a domain binding to
another domain, by a domain binding to a short protein motif, or by a motif binding to another
motif. In our previous work, we proposed an approach to predict motif–motif binding sites
for the yeast interactome by contrasting high-quality positive interactions and high-quality
non-interactions using a simple statistical analysis. Here, we extend this idea to more
comprehensively infer binding sites, including domain–domain, domain–motif, and motif–motif
interactions. In this study, we integrated 2854 yeast proteins that undergo 13 531 high-quality
interactions and 3491 yeast proteins undergoing 578 459 high-quality non-interactions. Overall,
we found 6315 significant binding site pairs involving 2371 domains and 637 motifs. Benchmarked
using the iPfam, DIP CORE, and MIPS, our inferred results are reliable. Interestingly, some of
our predicted binding site pairs may, at least in the yeast genome, guide researchers to assay
novel protein–protein interactions by mutagenesis or other experiments. Our work demonstrates
that by inferring significant protein–protein binding sites at an aggregate level combining
domain–domain, domain–motif and motif–motif levels based on high-quality positive and
negative datasets, this method may be capable of identifying the binding site pairs that mediate
protein–protein interactions.
Introduction
Proteins usually carry out their functions in a cell by interacting
with other molecules such as DNAs, RNAs, and other
proteins. In protein–protein interactions (PPIs), two poly-
peptides interact with one another through a subset of residues
known as a binding site.1,2 Thus, understanding these binding
sites may provide an insight into the molecular and bio-
chemical mechanisms of protein interactions. For instance,
mutations in the binding sites may induce aberrant inter-
actions and lead to abnormal cellular behavior and diseases.3
In the past few years, various high-throughput experimental
methods such as the yeast two-hybrid method (Y2H),4,5 mass
spectrometry (MS),6 tandem affinity purification (TAP),7,8 and
protein-fragment complementation assay (PCA)9 have been
proposed to identify PPIs. In addition, various kinds of
computational methods10–14 have also been developed to
complement the experimental techniques and have produced
massive numbers of putative PPIs. To assess the reliability of
the potential PPIs indicated either experimentally or compu-
tationally, different measurement indices have been proposed,
such as the expression profile reliability (EPR),15 paralogous
verification method (PVM),15 and protein localization method
(PLM).16 It is the development of these technologies and
methods, which generate a considerable number of reliable
and diverse PPIs, that enable genome-wide scale studies of
binding sites mediating protein–protein interactions.
Interacting proteins more often contact each other by a
domain binding to another domain, by a domain binding to a
short protein motif, or by a motif binding to another
motif.17,18 Currently, at the level of domain–domain (DD)
interactions, many inference methods have been proposed.
For example, the association method19,20 was developed to
define difference measures to identify domain pairs that are
observed in interacting proteins more frequently than expected
by chance or by non-interacting proteins. Due to the high
number of false positives and false negatives produced by
the yeast two-hybrid approach, Deng et al.21 proposed a
maximum likelihood estimation (MLE) method that allows
the treatment of these errors. Lee et al.22 extended this idea
and used a Bayesian method to integrate evidence and
construct a high-confidence domain interaction set. To detect
low-propensity, high-specificity domain interactions, Riley et al.23
proposed the domain pair exclusion analysis (DPEA) method.
Other approaches include the parsimonious explanation
method (PE) using linear programming optimization,24 the
p-value method,25 and message-passing algorithms.26 At the
same time, there are also algorithms devised to infer binding
motif pairs.27–30 Most, if not all, of these aforementioned
methods infer putative PPI binding sites either at the domain
or motif level, respectively. Moreover, with the exception of
our previous study,29 these methods usually use a positive PPI
dataset that is carefully assembled and a negative dataset that
is randomly sampled.
Obviously, with a comprehensive and high-quality PPI
network for one genome, it is more reasonable to delineate
PPI binding sites from an aggregate level that combines
motif–motif (MM), motif–domain (MD), and domain–domain
State Key Laboratory of Earth Surface Processes and ResourceEcology and College of Life Sciences, Beijing Normal University,Beijing 100875, China. E-mail: [email protected];Fax: +86-10-58807721; Tel: +86-10-58805045
PAPER www.rsc.org/molecularbiosystems | Molecular BioSystems
Publ
ishe
d on
17
Aug
ust 2
010.
Dow
nloa
ded
by M
onas
h U
nive
rsity
on
25/1
0/20
14 1
3:32
:39.
View Article Online / Journal Homepage / Table of Contents for this issue
This journal is c The Royal Society of Chemistry 2010 Mol. BioSyst., 2010, 6, 2164–2173 2165
(DD) levels. To this end, we extended our previous computa-
tional idea29 and attempted to develop an integrative frame-
work for PPI binding sites in the yeast genome. We integrated
reliable, diverse and non-redundant experimental PPIs from
extant studies31 and BioGRID32 into a positive PPI dataset
with high-quality, which is called the gold standard positive set
(GSPs). The set of negative PPIs is obtained by our previous
computational method,14,33 which is called the gold standard
negative set (GSNs) and is shown to be more reliable than
other negative interaction sets. To identify the MM, MD and
DD pairs that are statistically overrepresented in the GSPs
compared with the GSNs, the one-tailed Fisher’s exact test
was applied. A multiple testing technique was also adopted to
minimize the false discovery rate (FDR). To assess the
performance of our predicted results, two different types of
measurements were used. The first one, the sensitivity (SN),
attempts to find the enrichment of our predicted binding sites
in the validation datasets. The second one, the positive
predictive value (PPV), focuses on how many predicted
binding sites are correct based on a given PPI dataset. Our
results show satisfactory performance, as assessed by the SN
and PPV measures. At the same time, we have also compared
our method with three other approaches at the DD interaction
level, and we have found that our algorithm infers many more
reliable interactions than others. Taken together, these data
demonstrate that, by simultaneously inferring significant
protein–protein binding sites from the aggregate level based
on high-quality positive and negative datasets, our simple
method can efficiently identify significant and comprehensive
binding site pairs that mediate PPIs. These inferred binding
site pairs at the aggregate level should be helpful for the
understanding of the mechanism of protein–protein inter-
actions and insightful into protein functions, diseases and
genomic evolution.34
Results
Our approach involves four steps: (1) generating GSPs and
GSNs datasets; (2) annotating the protein sequences with
domains and motifs; (3) deriving two-by-two tables of MM,
MD and DD pairs; (4) detecting significantly over-represented
pairs by carefully controlling the false discovery rate (FDR).
In this section, we will describe the results of this approach by
applying it to the GSPs and the GSNs we derived.
Significantly over-represented motif–motif, motif–domain, and
domain–domain pairs
Within the yeast genome we studied, the GSPs consist of
13 531 interactions that involve 2854 proteins and the GSNs
comprise 578 459 non-interactions from 3491 proteins. There
are 4435 non-redundant proteins in total. Each of these
non-redundant proteins was annotated at the domain level
using Pfam (Pfam release 23.0) and at the motif level using
Prosite (version 20.44). Through this annotation process, 2371
(PfamA) domains and 637 motifs were assigned, respectively.
To detect MM, MD and DD pairs that were significantly
over-represented in the GSPs compared with the GSNs, the
one-tailed Fisher’s exact test was applied, where the p-value
represents a measure of significance in terms of the false
positive rate. The q-value method was employed to identify
as many significantly over-represented pairs as possible
(see Materials and methods) while incurring a relatively low
FDR. In total, we obtained 6315 significant binding site pairs
with q-values less than 0.01, including 1124 MM, 3856 MD,
and 1335 DD interactions (Table 1).
Validation of the inferred domain–domain interactions
The database of iPfam consists of experimental DD inter-
actions derived from the protein structures in the Protein Data
Bank (PDB).35 Due to its high reliability, we used iPfam
within Pfam (Pfam release 21.0) to assess the quality of our
inferred DD interactions. Only DD interactions assigned to
different proteins for yeast were used. Thus, we obtained
129 interactions involving 90 domains within iPfam. For these
90 domains, we inferred 191 interactions, of which 57 over-
lapped with iPfam interactions. More interestingly, we found
11 interactions (listed in Table 2) curated in iPfam that they
are either from proteins of other organisms or from intra-
protein interactions within yeast.
Proteins interact while participating in the same cellular
process, i.e. sharing a common function. This is usually used
to assess the reliability of PPIs.36,37 Similarly, we expected that
two components in a DD interaction pair might share a
common Gene Ontology (GO) annotation. We downloaded
the GO annotations of domains from Pfam23.0. For the 123
domain interactions without iPfam validation, three of them
have only one domain with GO annotation, and 108 have
either common molecular function or biological process GO
annotations. For the remaining 13 domain interactions, we
investigated their GO structures and found that six of them
have annotations of one domain that is the parent of the other
Table 1 The number of inferred binding site pairs based on the GSPsand the GSNs
Type Number Binding pairs
Motif 637 1124 (MM)Motif and domain 637 motifs and
2371 domains3856 (MD)
Domain 2371 1335 (DD)Aggregate levels 3008 6315 (MM/MD/DD)
The first column indicates the type of the binding sites. The second
column is the number of corresponding types, and the third column is
the number of binding site pairs at the corresponding level.
Table 2 Interactions confirmed by iPfam from other organisms orfrom intra-protein interactions for yeast
PfamA accession PfamA accession Species
PF00071 PF03810 Homo sapiensPF00071 PF00071 Mus musculusPF00018 PF00018 Rattus norvegicusPF01000 PF04983 Thermus aquaticusPF00626 PF04815 S. cerevisiaePF04811 PF04815 S. cerevisiaePF01192 PF04983 T. aquaticusPF00626 PF04811 S. cerevisiaePF02463 PF02463 Pyrococcus furiosusPF00626 PF00626 Oryctolagus cuniculusPF04563 PF04998 P. furiosus
Publ
ishe
d on
17
Aug
ust 2
010.
Dow
nloa
ded
by M
onas
h U
nive
rsity
on
25/1
0/20
14 1
3:32
:39.
View Article Online
2166 Mol. BioSyst., 2010, 6, 2164–2173 This journal is c The Royal Society of Chemistry 2010
domain. We also found one domain interaction whose
members have a common parent, as illustrated in Fig. 1. Thus,
all of these validations may indicate that, at the DD level, our
results are reliable and provide high coverage because most of
the predicted domain interactions are confirmed either by
iPfam, or by GO annotations. This suggests that our results
may include more reliable DD interactions than other
approaches, such as iPfam. These newly inferred interactions
may provide biologists with testable hypotheses about novel
PPIs in yeast.
More comprehensive binding sites from the viewpoint
of protein–protein interactions
Due to the very low coverage of MM or MD experimental
binding site data, a systematic direct validation for MM and
MD interactions is not possible as it is with DD interactions.
Here, we performed validation for these two types of interactions
at the PPI level using the MIPS and DIP CORE datasets. For
MIPS, we excluded PPIs predicted by two-hybrid and obtained
4311 protein interactions. These interactions consist of 1686
proteins involving 1570 motifs and domains. For the DIP CORE
dataset, there are 3251 interactions comprised of 1906 proteins
involving 1668 motifs and domains.
The SN and the PPV were calculated for each validation
dataset. We also compared their SN and PPV values with
those of the randomized binding site pair sets. Table 3 lists the
validation results benchmarked on the two datasets. The SNs
and PPVs of the MD interactions are relatively lower
compared with those of the MM interactions. The reason
may be that motifs are short and may randomly occur in more
proteins than domains do. To examine if this is the case, we
also mapped the DD interactions on the PPIs and calculated
the SNs and the PPVs (For MIPS, the SN and the PPV are
0.139 and 0.371, respectively; for DIP CORE, the SN and the
PPV are 0.337 and 0.606, respectively.). Although both SNs
and PPVs are significantly higher than expected at random,
they are much lower than those of the MM and MD inter-
actions. Although the SNs and the PPVs are much higher than
those of the sets of randomly chosen interaction pairs, the SNs
of MM interactions are not statistically significant. An
Fig. 1 Distribution of predicted domain–domain interactions. The pie chart represents the number of predicted domain–domain interactions and
the domains that are shared with iPfam.
Table 3 The sensitivities (SNs) and the positive predictive values(PPVs) of inferred and randomized binding site pairs
Validation dataset Interactions SN PPV
MIPS Predicted MM interactions 0.790 0.813Randomized MM interactions 0.306 0.053Predicted MD interactions 0.608 0.763Randomized MD interactions 0.101 0.030
DIP CORE Predicted MM interactions 0.797 0.861Randomized MM interactions 0.265 0.048Predicted MD interactions 0.753 0.847Randomized MD interactions 0.103 0.026
The first column is the validation dataset; the second column is the
type of interactions; the third column is the value of sensitivity (SN)
calculated as TP/(TP + FN); and the fourth column is the positive
predictive value (PPV) calculated as TP/(TP + FP), where TP
represents true positives, FN represents false negatives and FP repre-
sents false positives, were estimated with respect to each validation.
Here SN was defined as the fraction of protein interactions containing
the predicted binding site pairs in the validation dataset, and PPV was
defined as the fraction of predicted pairs mapped on the validation
dataset.
Publ
ishe
d on
17
Aug
ust 2
010.
Dow
nloa
ded
by M
onas
h U
nive
rsity
on
25/1
0/20
14 1
3:32
:39.
View Article Online
This journal is c The Royal Society of Chemistry 2010 Mol. BioSyst., 2010, 6, 2164–2173 2167
explanation may be that promiscuous motifs occur in a
majority of proteins. Indeed, after removal of the promiscuous
motifs, the SNs of MM interactions are significant for both
datasets (see Discussion).
Intuitively, the coverage of the binding sites for the aggre-
gate level must be higher than that of each individual level. In
addition, we expect that the PPV of the aggregate results
should not decrease much. If this is the case, over prediction
may have occurred; however, Fig. 2 shows that the SNs of
the aggregate level are vastly improved without greatly
affecting the PPVs. This demonstrates that the predicted
binding sites at the aggregate level give a more comprehensive
characterization of protein–protein interactions than does any
individual level.
Fig. 2 The sensitivity (SN) defined as TP/(TP + FN) (A) and the positive predictive value (PPV) (B) defined as TP/(TP + FP) estimated by DIP
CORE andMIPS. MM,MD and DD along x-axis refer to motif–motif interactions, motif–domain interactions, and domain–domain interactions,
respectively. The three levels refer to the interactions of the combined motif–motif, motif–domain and domain–domain levels.
Publ
ishe
d on
17
Aug
ust 2
010.
Dow
nloa
ded
by M
onas
h U
nive
rsity
on
25/1
0/20
14 1
3:32
:39.
View Article Online
2168 Mol. BioSyst., 2010, 6, 2164–2173 This journal is c The Royal Society of Chemistry 2010
Understanding the effects of mutations
If a mutation occurs within a protein binding site, the corres-
ponding interaction may be disrupted or their binding affinity
may decrease, and such a mutation event would affect the
normal biological processes of the cell. Our predictions of MD
interactions may suggest both possible mechanisms for such
aberrant interactions.
For example, the predicted motif PS00006 (position 7–10) of
TOP2_YEAST (Uniprot:P06786) is involved in binding to the
domain PF00130 of KPC1_YEAST (Uniprot:P24583). Inter-
estingly, Mouchel and Jenkins38 used a yeast two-hybrid
protein interaction screen and identified an interaction
between the catalytic domain of the yeast protein kinase 1
enzyme (Pkc1) and amino acids 3–37 of the Saccharomyces
cerevisiae topoisomerase II. When they mutated the amino
acids 9 and 10 of the yeast N-terminal topoisomerase II, they
failed to detect the interaction as before. As described in
Fig. 3A, this is consistent with our prediction that a mutation
in the motif PS00006 (position 7–10) might disrupt interaction.
Another example is that the predicted domain PF07653
(position 304–359) on SSU81_YEAST (Uniprot:P40073) may
be involved in binding the motif PS00007 (position 207–214) on
FUS1_YEAST. Nelson et al.39 identified the interaction
through domain SH3_2 on SSU81_YEAST and a proline-rich
peptide ligand on the Fus1p COOH-terminal cytoplasmic
region (95–512) in a two-hybrid screen. When they mutated
amino acid 352 of SSU81_YEAST, the interaction was
suppressed. This experiment further supports our prediction
results of the MD interaction (Fig. 3B).
Discussion
Proteins interact and make contact with each other at binding
sites. Previous related methods have been developed to detect
protein binding sites either at the motif or the domain level
individually. To the best of our knowledge, this is the first
study to simultaneously infer protein binding sites by combining
three levels of interactions, MM, MD and DD, on a genome
wide scale. It is helpful to address some differences between
our predicted results and those of other approaches.
Comparison with other prediction methods
Due to their importance in understanding protein–protein
interactions, domain–domain interactions have been extensively
computationally investigated. For example, many putative
Fig. 3 Illustration of the effects of mutations on interaction site predictions. (A) Motif 1 is mutated to motif 10. (B) One amino acid in domain 3 is
mutated. Proteins are depicted approximately to scale. Sites of motifs and domains are listed by their approximate positions. Interaction sites are
indicated by arrows, and the extent of thickness indicates the strength of the interactions.
Publ
ishe
d on
17
Aug
ust 2
010.
Dow
nloa
ded
by M
onas
h U
nive
rsity
on
25/1
0/20
14 1
3:32
:39.
View Article Online
This journal is c The Royal Society of Chemistry 2010 Mol. BioSyst., 2010, 6, 2164–2173 2169
DD interactions have been obtained by the DPEA method,23
by the Bayesian network model and the MLE method,22 or by
the PE method.24 All of these methods generate confidence
measures of domain pairs. The DPEA method uses the value
of Eij to assess the likelihood of highly special domain inter-
actions in multiple organisms. Lee et al. proposed a means to
measure the expected number of each pair of domains from
four species, and couple this with integrated evidence of Gene
Ontology annotation and domain fusion using a Bayesian
approach. Guimaraes et al. derived a value called the ‘linear
program (LP)-score’ using linear programming optimization
for domain pairs from multiple organisms.
Here, we compared our inferred interacting domain pairs
with their predictions. We extracted the common yeast
Fig. 4 The number of interaction domains shared with iPfam (A) and the sensitivity (SN) (B) for the four methods.
Publ
ishe
d on
17
Aug
ust 2
010.
Dow
nloa
ded
by M
onas
h U
nive
rsity
on
25/1
0/20
14 1
3:32
:39.
View Article Online
2170 Mol. BioSyst., 2010, 6, 2164–2173 This journal is c The Royal Society of Chemistry 2010
domains shared by the iPfam and the other prediction
methods, respectively. Fig. 4A shows the number of DD
interactions predicted by the four methods and the number
of interactions confirmed by iPfam (Pfam release 21.0),
respectively. Fig. 4B depicts the SNs of four methods
estimated by iPfam. As shown, Riley et al. predict the lowest
number of putative DD interactions, with only 19 pairs under-
lying 34 domains and eight of 19 pairs in common with iPfam.
Except for the PPV of Riley et al. (42% vs. 30% of ours), all
PPVs and SNs are smaller than those of ours. The SN of
Riley et al. (47%) is lower than that of our method (77%). The
PPV and the SN of Guimaraes et al. are 33% and 37%,
respectively. The PPV and the SN of Lee et al. are 23% and
36%, respectively. Overall, this suggests that our method may
perform better than the three methods at inferring domain–
domain interactions for the yeast.
Mapping predicted binding sites on the GSPs
When the predicted binding sites were mapped onto the GSPs,
we found that SN was 96%. Why do we miss the remaining
4% of the PPIs (548) in the GSPs and what do they look like?
The answer to these questions may be that there are more rare
domains and/or motifs occurring within these proteins. From
the 4% PPIs of the GSPs, we extracted 568 different proteins
containing 692 motifs and domains. Of these domains and
motifs, there are 325 domains and motifs occurring in only one
protein, and 102 domains and motifs occurring only in two
proteins, accounting for 62% of the total motifs and domains
(427/692). Because our method is ineffective in detecting rare
domain and/or motif interactions, due to the weakness of the
one-tailed Fisher’s exact test, these PPIs may not be mapped
by this methodology.
Inference performance is affected by promiscuous motifs
To examine whether the set of predicted binding site pairs are
mapped onto the evaluation datasets at a significantly higher
rate than the set of randomly chosen pairs, empirical p-values
for the SN and the PPV of each evaluation dataset were
calculated. Nevertheless, in the randomization process, we
found that the SNs of the MM level and the aggregate level
for the evaluation datasets (Table 4) are not significant. This
may be because those motifs occur in the majority of proteins
(promiscuous) and are involved in post-translational modifi-
cations. To test this hypothesis, after excluding the 12 motifs
(listed in Table 5) with a high probability of occurrence during
the randomization process, all the SNs for the evaluation
datasets become significant. This indicates that our method
could detect the pairs with frequencies that are significantly
higher in the GSPs than in GSNs, rather than the pairs present
in the GSPs and missing in the GSNs.
Some caveats
There are several limitations to our current approach: (1) We
only inferred binding sites of PPIs for the yeast genome, due to
a lack of high-quality non-interactions for other species.
(2) Obviously, the prediction power of our method is limited
by the quality and coverage of protein interactions and
non-interactions. (3) Our method is devised to find signifi-
cantly over-represented MM, MD and DD pairs in the
interaction proteins compared with the non-interaction
Table 4 Empirical p-values for the SNs and the PPVs within each evaluation dataset
Validation dataset InteractionsP-value including thepromiscuous motifs
P-value eliminating thepromiscuous motifs
MIPS MM interactions 0.166 0The aggregate level interactions 0.022 0
DIP CORE MM interactions 0.112 0The aggregate level interactions 0.018 0
The empirical p-value is defined as the ratio of the times that the PPV (or SN) of randomized interactions was equal to or greater than that of the
predicted interactions.
Table 5 Motifs with a high probability of occurrence
Motif Pattern Description
PS00001 N-{P}-[ST]-{P} N-glycosylation sitePS00004 [RK](2)-x-[ST] cAMP- and cGMP-dependent protein kinase
phosphorylation sitePS00005 [ST]-x-[RK] Protein kinase C phosphorylation sitePS00006 [ST]-x(2)-[DE] Casein kinase II phosphorylation sitePS00007 [RK]-x(2,3)-[DE]-x(2,3)-Y Tyrosine kinase phosphorylation sitePS00008 G-{EDRKHPFYW}-x(2)-[STAGCN]-{P} N-myristoylation sitePS00009 x-G-[RK]-[RK] Amidation sitePS00016 R-G-D Cell attachment sequencePS00017 [AG]-x(4)-G-K-[ST] ATP/GTP-binding site motif A (P-loop)PS00029 L-x(6)-L-x(6)-L-x(6)-L Leucine zipper patternPS00294 C-{DENQ}-[LIVM]-x> Prenyl group binding site (CAAX box)PS00342 [STAGCN]-[RKH]-[LIVMAFY]> Microbodies C-terminal targeting signal
The first column is the Prosite accession number of the motif. The second column is the pattern of the motif, and the third column is the description
of the motif.
Publ
ishe
d on
17
Aug
ust 2
010.
Dow
nloa
ded
by M
onas
h U
nive
rsity
on
25/1
0/20
14 1
3:32
:39.
View Article Online
This journal is c The Royal Society of Chemistry 2010 Mol. BioSyst., 2010, 6, 2164–2173 2171
proteins and, as discussed before, it is unlikely to detect
interactions with rare motifs and domains. (4) Compared with
other methods that ignore the gaps between the domains, our
method used an aggregate level of interactions combining
MM, MD, and DD levels. Nevertheless, not all the motifs in
the Prosite dataset mediate protein interactions, and our
method does not deal with this case.
Materials and methods
Data sources
In this study, we used the S. cerevisiae genome and the related
protein interactions, non-interacting proteins, and domain and
motif information. Three independent public datasets, iPfam,
DIP CORE andMIPS, were used for performance assessment.
Yeast genome. We downloaded the yeast genome from the
Saccharomyces Genome Database (SGD, June 2006 version)
(http://yeastgenome.org/). It contains 5884 open reading
frames (ORFs), of which 5873 ORFs have the Swiss-Prot
accession numbers.
Positive interaction dataset. We constructed a GSPs for
yeast proteins as Wu et al.40 integrated. We downloaded the
literature-curated (LC) and high-throughput (HTP) physical
protein–protein datasets, as previously reported,31 and
excluded all the intra-protein interactions. The LC dataset
was directly adopted. The HTP and the other five HTP
interaction datasets from BioGRID32 (version 2.0.33) were
filtered. The HTP interactions that satisfied at least one of the
following criteria were assembled into our interaction dataset:
(1) Both A–B and B–A interactions are recorded in the HTP
datasets. (2) The interaction is from more than one different
literatures. (3) The interaction is derived from more than one
different experimental methods. (4) The RSS values14,33 on
both biological processes (BP) and cellular components (CC)
ontologies are greater than or equal to 0.8. Using these
criteria, we obtained 13 531 interactions underlying 2854
proteins with at least one domain.
Negative interaction dataset. The GSNs for yeast was
obtained from SPIDer database14,33 in which the RSS values
on BP are within the range of [0, 0.4] and the RSS values on CC
are within the range of [0, 0.5]. Overall, we obtained 578 459
non-interactions from 3491 proteins with at least one domain.
Domain assignment. The protein–domain relationships for
yeast proteins were obtained from Pfam.41 Pfam is a collection
of protein domains and families, represented as multiple
sequence alignments and as profile hidden Markov models.
As of July 2008, Pfam23.0 (http://pfam.sanger.ac.uk/) consists
of 10 340 families. The Swiss-Prot accession numbers indicate
that there are 4435 ORFs with at least one domain. In total,
2371 domains were assigned.
Motif assignment. Motif sequences were obtained from
the Prosite database.42 The Prosite database is a
collection of patterns. As of March 2009, Prosite 20.44
(http://www.expasy.ch/prosite/) consists of 1313 patterns.
We used the ScanProsite tool43 to scan protein sequences
against the Prosite database. There are 637 motifs assigned
within the 4435 selected ORFs.
Evaluation datasets. Three datasets were used to validate our
results, including iPfam,35 the S. cerevisiae core subset of DIP
(DIP CORE)44 that has passed a quality assessment and
MIPS.45 Only the interactions for yeast were extracted. For
iPfam (Pfam release 21.0), we only picked the domain inter-
actions assigned to different proteins, and we obtained 129
interaction domains. We obtained 3251 interactions with
different proteins, called inter-protein interactions, from DIP
CORE (October 2008 version). We extracted the inter-protein
interactions from MIPS (May 2005 version), excluding the
two-hybrid interactions, and obtained 4311 interactions.
Predicting binding site pairs at motif–motif, motif–domain and
domain–domain levels
To detect MM, MD and DD pairs that are significantly
over-represented in GSPs compared with GSNs, the one-tailed
Fisher’s exact test was used. In addition, the q-value method
was applied to correct the false discovery rate (FDR) of
multiple testing. For the inferred binding site pairs, the
independent datasets were used for validation.
One-tailed Fisher’s exact test for determining interaction
pairs that occur in a significant number of the GSPs. The
one-tailed Fisher’s exact test was used to determine the pairs
that occur more often in the GSPs than in the GSNs. A
two-by-two frequency table was constructed for each pair, in
which the two rows represent the GSPs and GSNs, and the
two columns represent the number of protein pairs including
predicted interaction pairs, and those not including the given
pairs. P-values were calculated using the R Statistics package.
Multiple testing. Using the R Statistics package, written by
Dabney and Storey,46 we obtained a q-value for each MM,
MD and DD pair. P-value is a measure of significance
maintaining type I error on a certain level.46 The q-value is
similar to the p-value, but the q-value is a measure of signi-
ficance in terms of the FDR.47 The QVALUE software
(http://www.genomics.princeton.edu/storeylab/qvalue/linux.html)
was used to adjust the p-values to their corresponding q-values.
Evaluating the predicted interaction pairs. To validate the
quality of the binding site pairs, we used two measures: the
fraction of interacting proteins containing the predicted
binding site pairs in the validation dataset, i.e. the sensitivity
(SN), and the fraction of predicted binding site pairs mapped
on the validation dataset, i.e. the positive predictive value
(PPV). The sensitivity of a test is used to measure the ability to
detect true positives. We calculated the SN and PPV of our
inferred results for each evaluation dataset.
A reasonable prediction approach should contain more
inferred binding site pairs mapped on the evaluation datasets
than expected at random. To verify this, we calculated the
PPV and the SN of randomized pairs for each evaluation
dataset. First, we extracted domains and motifs from an
evaluation dataset and grouped them into an object set. Then,
we randomly extracted N pairs from the object set and
grouped them into a randomized binding site pair set, where
Publ
ishe
d on
17
Aug
ust 2
010.
Dow
nloa
ded
by M
onas
h U
nive
rsity
on
25/1
0/20
14 1
3:32
:39.
View Article Online
2172 Mol. BioSyst., 2010, 6, 2164–2173 This journal is c The Royal Society of Chemistry 2010
N is the number of predicted pairs from the object set. Finally,
we calculated the SN and PPV of the randomized pair set for
the evaluation dataset. This randomization process was
repeated 1000 times, and the average of PPV (or SN) from
1000 replicates was obtained. We can simultaneously obtain
an empirical p-value for each evaluation dataset. For each of
1000 randomization simulations, it gives one PPV (or SN).
Then, the p-value is defined as how often these PPVs (or SNs)
are equal to or greater than the observed PPV (or SN).
Conclusions
It is known that proteins containing predicted binding sites
may not actually interact with each other. Although our
results cannot be used to predict PPIs directly, we believe that
in combination with other evidence, such as protein profiles
and protein localization, our results should help in predicting
putative PPIs, in particular in other species. More interestingly
and importantly, our results could give biologists clues to
assay potential protein interactions via mutation experiments.
However, we should note that our method is ineffective for
detecting interactions with rare motifs and/or domains. This
caveat may be mitigated with DPEA developed by Riley
et al.23 or be tackled by any new methods being proposed in
future. We will also update and extend our previous online
database, SPIDer,33 to present the useful protein binding site
information. In summary, all of these tools should provide a
better understanding of protein–protein interaction mecha-
nisms, which is important for insights into protein functions,
diseases and genomic evolution.
Abbreviations
DD domain–domain
DM motif–domain
FDR false discovery rate
GSNs gold standard negative set
GSPs gold standard positive set
MM motif–motif
PPIs protein–protein interactions
PPV positive predictive value
SN sensitivity
Acknowledgements
We thank two anonymous reviewers for their constructive
comments. This work was supported by State Key Laboratory
of Earth Surface Processes and Resource Ecology and by
NSFC (Grant 30770445).
References
1 O. Keskin, N. Tuncbag and A. Gursoy, Curr. Pharm. Biotechnol.,2008, 9, 67–76.
2 N. Tuncbag, G. Kar, O. Keskin, A. Gursoy and R. Nussinov,Briefings Bioinf., 2009, 10, 217–232.
3 M. G. Kann, Briefings. Bioinf., 2007, 8, 333–346.4 P. Uetz, L. Giot, G. Cagney, T. A. Mansfield, R. S. Judson,J. R. Knight, D. Lockshon, V. Narayan, M. Srinivasan,P. Pochart, A. Qureshi-Emili, Y. Li, B. Godwin, D. Conover,
T. Kalbfleisch, G. Vijayadamodar, M. J. Yang, M. Johnston,S. Fields and J. M. Rothberg, Nature, 2000, 403, 623–627.
5 T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori andY. Sakaki, Proc. Natl. Acad. Sci. U. S. A., 2001, 98, 4569–4574.
6 Y. Ho, A. Gruhler, A. Heilbut, G. D. Bader, L. Moore,S. L. Adams, A. Millar, P. Taylor, K. Bennett, K. Boutilier,L. Y. Yang, C. Wolting, I. Donaldson, S. Schandorff,J. Shewnarane, M. Vo, J. Taggart, M. Goudreault, B. Muskat,C. Alfarano, D. Dewar, Z. Lin, K. Michalickova, A. R. Willems,H. Sassi, P. A. Nielsen, K. J. Rasmussen, J. R. Andersen,L. E. Johansen, L. H. Hansen, H. Jespersen, A. Podtelejnikov,E. Nielsen, J. Crawford, V. Poulsen, B. D. Sorensen, J. Matthiesen,R. C. Hendrickson, F. Gleeson, T. Pawson, M. F. Moran,D. Durocher, M. Mann, C. W. V. Hogue, D. Figeys andM. Tyers, Nature, 2002, 415, 180–183.
7 G. Rigaut, A. Shevchenko, B. Rutz, M. Wilm, M. Mann andB. Seraphin, Nat. Biotechnol., 1999, 17, 1030–1032.
8 O. Puig, F. Caspary, G. Rigaut, B. Rutz, E. Bouveret, E. Bragado-Nilsson, M. Wilm and B. Seraphin, Methods, 2001, 24, 218–229.
9 K. Tarassov, V. Messier, C. R. Landry, S. Radinovic, M. M. S.Molina, I. Shames, Y. Malitskaya, J. Vogel, H. Bussey andS. W. Michnick, Science, 2008, 320, 1465–1470.
10 T. Dandekar, B. Snel, M. Huynen and P. Bork, Trends Biochem.Sci., 1998, 23, 324–328.
11 A. J. Enright, I. Iliopoulos, N. C. Kyrpides and C. A. Ouzounis,Nature, 1999, 402, 86–90.
12 C. S. Goh, A. A. Bogan, M. Joachimiak, D. Walther andF. E. Cohen, J. Mol. Biol., 2000, 299, 283–293.
13 E. M. Marcotte, M. Pellegrini, H. L. Ng, D. W. Rice, T. O. Yeatesand D. Eisenberg, Science, 1999, 285, 751–753.
14 X. M. Wu, L. Zhu, J. Guo, D. Y. Zhang and K. Lin, Nucleic AcidsRes., 2006, 34, 2137–2150.
15 C. M. Deane, L. Salwinski, I. Xenarios and D. Eisenberg, Mol.Cell. Proteomics, 2002, 1, 349–356.
16 E. Sprinzak, S. Sattath and H. Margalit, J. Mol. Biol., 2003, 327,919–923.
17 T. Pawson and P. Nash, Science, 2003, 300, 445–452.18 P. Puntervoll, R. Linding, C. Gemund, S. Chabanis-Davidson,
M. Mattingsdal, S. Cameron, D. M. A. Martin, G. Ausiello,B. Brannetti, A. Costantini, F. Ferre, V. Maselli, A. Via,G. Cesareni, F. Diella, G. Superti-Furga, L. Wyrwicz, C. Ramu,C. McGuigan, R. Gudavalli, I. Letunic, P. Bork, L. Rychlewski,B. Kuster, M. Helmer-Citterich, W. N. Hunter, R. Aasland andT. J. Gibson, Nucleic Acids Res., 2003, 31, 3625–3630.
19 E. Sprinzak and H. Margalit, J. Mol. Biol., 2001, 311, 681–692.20 S. K. Ng, Z. Zhang and S. H. Tan, Bioinformatics, 2003, 19,
923–929.21 M. H. Deng, S. Mehta, F. Z. Sun and T. Chen, Genome Res., 2002,
12, 1540–1548.22 H. Lee, M. H. Deng, F. Z. Sun and T. Chen, BMC Bioinf, 2006, 7,
269.23 R. Riley, C. Lee, C. Sabatti and D. Eisenberg, Genome Biol., 2005,
6, R89.24 K. S. Guimaraes, R. Jothi, E. Zotenko and T. M. Przytycka,
Genome Biol., 2006, 7, R104.25 T. M. W. Nye, C. Berzuini, W. R. Gilks, M. M. Babu and
S. A. Teichmann, Bioinformatics, 2005, 21, 993–1001.26 M. Iqbal, A. A. Freitas, C. G. Johnson and M. Vergassola,
Bioinformatics, 2008, 24, 2064–2070.27 H. Q. Li and J. Y. Li, Bioinformatics, 2005, 21, 314–324.28 S. H. Tan, W. Hugo, W. K. Sung and S. K. Ng, BMC Bioinf., 2006,
7, 502.29 J. Guo, X. M. Wu, D. Y. Zhang and K. Lin, Nucleic Acids Res.,
2008, 36, 2002–2011.30 H. D. Wang, E. Segal, A. Ben-Hur, Q. R. Li, M. Vidal and
D. Koller, Genome Biol., 2007, 8, R192.31 T. Reguly, A. Breitkreutz, L. Boucher, B.-J. Breitkreutz,
G. C. Hon, C. L. Myers, A. Parsons, H. Friesen, R. Oughtred,A. Tong, C. Stark, Y. Ho, D. Botstein, B. Andrews, C. Boone,O. G. Troyanskya, T. Ideker, K. Dolinski, N. N. Batada andM. Tyers, J. Biol., 2006, 5, 11.
32 C. Stark, B. J. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutzand M. Tyers, Nucleic Acids Res., 2006, 34, D535–D539.
33 X. M. Wu, L. Zhu, J. Guo, C. Fu, H. J. Zhou, D. Dong, Z. B. Li,D. Y. Zhang and K. Lin, BMC Bioinf., 2006, 7, S16.
Publ
ishe
d on
17
Aug
ust 2
010.
Dow
nloa
ded
by M
onas
h U
nive
rsity
on
25/1
0/20
14 1
3:32
:39.
View Article Online
This journal is c The Royal Society of Chemistry 2010 Mol. BioSyst., 2010, 6, 2164–2173 2173
34 B. J. Mayer, J. Cell Sci., 2001, 114, 1253–1263.35 R. D. Finn, M. Marshall and A. Bateman, Bioinformatics, 2005,
21, 410–412.36 B. Schwikowski, P. Uetz and S. Fields, Nat. Biotechnol., 2000, 18,
1257–1261.37 C. von Mering, R. Krause, B. Snel, M. Cornell, S. G. Oliver,
S. Fields and P. Bork, Nature, 2002, 417, 399–403.38 N. A. P. Mouchel and J. R. Jenkins, FEBS Lett., 2006, 580, 51–57.39 B. Nelson, A. B. Parsons, M. Evangelista, K. Schaefer,
K. Kennedy, S. Ritchie, T. L. Petryshen and C. Boone, Genetics,2004, 166, 67–77.
40 X. M. Wu, J. Guo, D. Y. Zhang and K. Lin, Proteomics, 2009, 9,4812–4824.
41 R. D. Finn, J. Tate, J. Mistry, P. C. Coggill, S. J. Sammut,H. R. Hotz, G. Ceric, K. Forslund, S. R. Eddy, E. L. L.
Sonnhammer and A. Bateman, Nucleic Acids Res., 2008, 36,D281–D288.
42 N. Hulo, A. Bairoch, V. Bulliard, L. Cerutti, E. De Castro,P. S. Langendijk-Genevaux, M. Pagni and C. J. A. Sigrist, NucleicAcids Res., 2006, 34, D227–D230.
43 A. Gattiker, E. Gasteiger and A. Bairoch, Appl. Bioinf., 2002, 1,107–108.
44 I. Xenarios, L. Salwinski, X. Q. J. Duan, P. Higney, S. M. Kim andD. Eisenberg, Nucleic Acids Res., 2002, 30, 303–305.
45 U. Guldener, M. Munsterkotter, M. Oesterheld, P. Pagel,A. Ruepp, H. W. Mewes and V. Stumpflen, Nucleic Acids Res.,2006, 34, D436–D441.
46 J. D. Storey, J. R. Stat. Soc., Ser. B: Stat.Methodol., 2002, 64, 479–498.47 J. D. Storey and R. Tibshirani, Proc. Natl. Acad. Sci. U. S. A.,
2003, 100, 9440–9445.
Publ
ishe
d on
17
Aug
ust 2
010.
Dow
nloa
ded
by M
onas
h U
nive
rsity
on
25/1
0/20
14 1
3:32
:39.
View Article Online