protein-protein interaction using svm based kernel,jacob coefficient and gene ontology
DESCRIPTION
Protein-Protein interactions discovered by the existing high-throughput techniques contain very high amount of false positives. Here we present an SVM based approach to generate a model that is built on sequence and non-sequence based information of the interacting proteins. This model is used to assess the reliability of given protein-protein interactions. It was run on the interaction data of a pathogenic bacterium; Treponema pallidum (causes Syphilis in humans) obtained from Yeast two hybrid experiments. Various kernels were used for building the model and of all, Sigmoid kernel performed well when used with all the features combined with area under the receiver operating curve (ROC) as 0.53.TRANSCRIPT
SVM based approach to assess the reliability of
protein-protein interactions
Meher Preethi Boorgula, Ronak Shah,
Neerja Katiyar
Geor
ge M
ason
Uni
vers
ity
Motivation:
� Protein interactions play a key role in many
cellular processes.
� Distortion of protein interfaces may lead to
development of many diseases.
� Reliable Protein-protein interactions (PPIs)
conserved among different species and that are
involved in diseases would be very helpful for
researchers.Geo
rge
Maso
n Un
iver
sity
Problem Statement:
� Protein-Protein Interactions (PPIs) are very
helpful in functional annotation of proteins. It
is important that the PPI data is reliable.
� Thus, we try to predict the reliability of PPIs
with respect to a disease causing bacterium.
Geor
ge M
ason
Uni
vers
ity
Objective:
� To create a prediction model based on Kernel
method (SVM) to assess the reliability of PPIs
in Treponema pallidum obtained from Yeast
Two Hybrid (Y2H) system.
� To classify the interactions as reliable and not
reliable.
Geor
ge M
ason
Uni
vers
ity
Introduction:
� Protein-protein interactions can be identified with the help of high-throughput techniques like the Yeast-two Hybrid (Y2H) and Mass Spectrometry (MS).
� The main disadvantage with these existing techniques is the amount of false-positives in the data obtained.
� So, assessing the reliability of PPIs is necessary.Geor
ge M
ason
Uni
vers
ity
Methodology:
Preparation of data sets
Extract the attributes
Create & test model using SVM light
Evaluate the performance of the model
Analyze the reliability of PPI data sets
Geor
ge M
ason
Uni
vers
ity
Datasets:
� Raw data of interactions was obtained from
Y2H experiments performed at J.Craig Venter
Institute.
� This data was then organized into train and
test sets by considering equal number of
positive and negative examples.
� Positive – High Confidence data
� Negative – Low Confidence data
Geor
ge M
ason
Uni
vers
ity
Dataset (Contd…)
� All Interactions = 2993
� High Confidence = 721
� Common Interactions = 66
� Total (excluding common) = 3648
� Train & Test datasets were made by taking
1824 interactions.
Geor
ge M
ason
Uni
vers
ity
Extracting Attributes:
� Attributes chosen include:
- Sequence based:
i. occurrence of 5-mers in the sequence data
ii. Hydrophobicity
- Non-sequence based:
i. Jaccard coefficient
ii. GO AnnotationGeor
ge M
ason
Uni
vers
ity
Hydrophobicity:
� Protein interaction depends on the nature of the
active/binding site.
� Hydrophobicity profile was used in order to extract
this feature.
� Average Hydropathy was calculated for a sequence
based on the hydrophobicity of each amino acid
residue.
� This was obtained using the tool “ProteinGRAVY”.Geor
ge M
ason
Uni
vers
ity
Jaccard coefficient:
� In a PPI network, the neighbors of interacting
proteins also tend to interact.
Jaccard coefficient:
|N(v) U N(u)| / |N(v) ∩ N(u)|
where u, v are the interacting proteins
N(X) = set of neighbors of protein X in the PPI
network Geor
ge M
ason
Uni
vers
ity
GO Annotations:
� Proteins that are present in the same cellular
component or that participate in same biological
processes are more likely to interact.
� This was captured with the help of extracting
identical GO IDs for the interacting proteins.
� Interacting proteins with atleast one common GO
ID was considered reliable.
Geor
ge M
ason
Uni
vers
ity
Occurrence of 5-mers
� Spectrum kernel models a sequence in the
space of all k-mers (5-mers).
� All possible 5-mers in the protein sequences
were obtained for the data.
� Number of times each 5-mer appears in the
sequence data for both bait and prey proteins
was extracted.Geor
ge M
ason
Uni
vers
ity
Creating & Testing Model:
� SVM Light was used to create a classification
model based on linear & sigmoid kernel.
� Test data was applied to the model in order to
classify it.
� The performance of the model was evaluated
based on Accuracy, Precision and Recall
values. Geor
ge M
ason
Uni
vers
ity
Experiments Performed:
1) Model generated using the attribute
Hydrophobicity.
2) Model generated using the attribute JC
3) Model generated using both of these
attributes.
4) Model generated using both these attributes
on a different data set (equal number of
positive and negative examples).
Geor
ge M
ason
Uni
vers
ity
Results for Linear Kernel:
0.00.00.00.0Recall
(%)
----Precision
(%)
51.2379.8879.9979.99Accuracy
(%)
Exp-4Exp-3Exp-2Exp-1
Geor
ge M
ason
Uni
vers
ity
Results for Sigmoid Kernel:
45.790.0--Recall
(%)
57.80 0.0--Precision
(%)
57.2679.88--Accuracy
(%)
Exp-4Exp-3Exp-2Exp-1
Geor
ge M
ason
Uni
vers
ity
Observation:
� Results obtained were not reliable as the
model was built using only two attributes.
� This would not be efficient in discriminating
the positive & negative examples.
� Also, it was observed that there was no
significance of the positive examples while
creating the model. Geor
ge M
ason
Uni
vers
ity
To Be done:
� Extracting the attribute “occurrence of 5-mers” for the protein pairs and perform all the experiments.
� Obtain data from INTACT database to increase the number of positive examples and to overcome the number of false positives in the data since it is from Y2H experiments.
� Compare the performance with the existing model based on “Logistic Regression”.
Geor
ge M
ason
Uni
vers
ity
Problems:
� The major problem for extracting attributes
which were dependent on the annotation was
that Treponema is not fully annotated.
� The interaction data for Treponema is also not
reliable.
Geor
ge M
ason
Uni
vers
ity
Future Work:
� We would like to apply this model to
Streptococcus Pneumoniae.
� Using PSSM scores by performing PSI-Blast
would be helpful.
� Analyze for the biological relevance of our
predictions and then test experimentally the
interactions predicted to be reliable by the
model.
Geor
ge M
ason
Uni
vers
ity
References:
� Dr.Peter Uetz et al (J.Craig Venter Institute)
� Kernel methods for predicting protein–protein
interactions by Asa Ben-Hur & William Stafford
Noble
� SVM Light: http://svmlight.joachims.org/
� Protein GRAVY: http://www.bioinformatics.org/sms2/protein_gravy.html
� PIR: http://pir.georgetown.edu/
Geor
ge M
ason
Uni
vers
ity