protein-protein interaction using svm based kernel,jacob coefficient and gene ontology

SVM based approach to assess the reliability of

protein-protein interactions

Meher Preethi Boorgula, Ronak Shah,

Neerja Katiyar

Geor

ge M

ason

Uni

vers

ity

Motivation:

� Protein interactions play a key role in many

cellular processes.

� Distortion of protein interfaces may lead to

development of many diseases.

� Reliable Protein-protein interactions (PPIs)

conserved among different species and that are

involved in diseases would be very helpful for

researchers.Geo

rge

Maso

n Un

iver

sity

Problem Statement:

� Protein-Protein Interactions (PPIs) are very

helpful in functional annotation of proteins. It

is important that the PPI data is reliable.

� Thus, we try to predict the reliability of PPIs

with respect to a disease causing bacterium.

Geor

ge M

ason

Uni

vers

ity

Objective:

� To create a prediction model based on Kernel

method (SVM) to assess the reliability of PPIs

in Treponema pallidum obtained from Yeast

Two Hybrid (Y2H) system.

� To classify the interactions as reliable and not

reliable.

Geor

ge M

ason

Uni

vers

ity

Introduction:

� Protein-protein interactions can be identified with the help of high-throughput techniques like the Yeast-two Hybrid (Y2H) and Mass Spectrometry (MS).

� The main disadvantage with these existing techniques is the amount of false-positives in the data obtained.

� So, assessing the reliability of PPIs is necessary.Geor

ge M

ason

Uni

vers

ity

Methodology:

Preparation of data sets

Extract the attributes

Create & test model using SVM light

Evaluate the performance of the model

Analyze the reliability of PPI data sets

Geor

ge M

ason

Uni

vers

ity

Datasets:

� Raw data of interactions was obtained from

Y2H experiments performed at J.Craig Venter

Institute.

� This data was then organized into train and

test sets by considering equal number of

positive and negative examples.

� Positive – High Confidence data

� Negative – Low Confidence data

Geor

ge M

ason

Uni

vers

ity

Dataset (Contd…)

� All Interactions = 2993

� High Confidence = 721

� Common Interactions = 66

� Total (excluding common) = 3648

� Train & Test datasets were made by taking

1824 interactions.

Geor

ge M

ason

Uni

vers

ity

Extracting Attributes:

� Attributes chosen include:

- Sequence based:

i. occurrence of 5-mers in the sequence data

ii. Hydrophobicity

- Non-sequence based:

i. Jaccard coefficient

ii. GO AnnotationGeor

ge M

ason

Uni

vers

ity

Hydrophobicity:

� Protein interaction depends on the nature of the

active/binding site.

� Hydrophobicity profile was used in order to extract

this feature.

� Average Hydropathy was calculated for a sequence

based on the hydrophobicity of each amino acid

residue.

� This was obtained using the tool “ProteinGRAVY”.Geor

ge M

ason

Uni

vers

ity

Jaccard coefficient:

� In a PPI network, the neighbors of interacting

proteins also tend to interact.

Jaccard coefficient:

|N(v) U N(u)| / |N(v) ∩ N(u)|

where u, v are the interacting proteins

N(X) = set of neighbors of protein X in the PPI

network Geor

ge M

ason

Uni

vers

ity

GO Annotations:

� Proteins that are present in the same cellular

component or that participate in same biological

processes are more likely to interact.

� This was captured with the help of extracting

identical GO IDs for the interacting proteins.

� Interacting proteins with atleast one common GO

ID was considered reliable.

Geor

ge M

ason

Uni

vers

ity

Occurrence of 5-mers

� Spectrum kernel models a sequence in the

space of all k-mers (5-mers).

� All possible 5-mers in the protein sequences

were obtained for the data.

� Number of times each 5-mer appears in the

sequence data for both bait and prey proteins

was extracted.Geor

ge M

ason

Uni

vers

ity

Creating & Testing Model:

� SVM Light was used to create a classification

model based on linear & sigmoid kernel.

� Test data was applied to the model in order to

classify it.

� The performance of the model was evaluated

based on Accuracy, Precision and Recall

values. Geor

ge M

ason

Uni

vers

ity

Experiments Performed:

1) Model generated using the attribute

Hydrophobicity.

2) Model generated using the attribute JC

3) Model generated using both of these

attributes.

4) Model generated using both these attributes

on a different data set (equal number of

positive and negative examples).

Geor

ge M

ason

Uni

vers

ity

Results for Linear Kernel:

0.00.00.00.0Recall

(%)

----Precision

(%)

51.2379.8879.9979.99Accuracy

(%)

Exp-4Exp-3Exp-2Exp-1

Geor

ge M

ason

Uni

vers

ity

Results for Sigmoid Kernel:

45.790.0--Recall

(%)

57.80 0.0--Precision

(%)

57.2679.88--Accuracy

(%)

Exp-4Exp-3Exp-2Exp-1

Geor

ge M

ason

Uni

vers

ity

Observation:

� Results obtained were not reliable as the

model was built using only two attributes.

� This would not be efficient in discriminating

the positive & negative examples.

� Also, it was observed that there was no

significance of the positive examples while

creating the model. Geor

ge M

ason

Uni

vers

ity

To Be done:

� Extracting the attribute “occurrence of 5-mers” for the protein pairs and perform all the experiments.

� Obtain data from INTACT database to increase the number of positive examples and to overcome the number of false positives in the data since it is from Y2H experiments.

� Compare the performance with the existing model based on “Logistic Regression”.

Geor

ge M

ason

Uni

vers

ity

Problems:

� The major problem for extracting attributes

which were dependent on the annotation was

that Treponema is not fully annotated.

� The interaction data for Treponema is also not

reliable.

Geor

ge M

ason

Uni

vers

ity

Future Work:

� We would like to apply this model to

Streptococcus Pneumoniae.

� Using PSSM scores by performing PSI-Blast

would be helpful.

� Analyze for the biological relevance of our

predictions and then test experimentally the

interactions predicted to be reliable by the

model.

Geor

ge M

ason

Uni

vers

ity

References:

� Dr.Peter Uetz et al (J.Craig Venter Institute)

� Kernel methods for predicting protein–protein

interactions by Asa Ben-Hur & William Stafford

Noble

� SVM Light: http://svmlight.joachims.org/

� Protein GRAVY: http://www.bioinformatics.org/sms2/protein_gravy.html

� PIR: http://pir.georgetown.edu/

Geor

ge M

ason

Uni

vers

ity

protein-protein interaction using svm based kernel,jacob coefficient and gene ontology

Documents

george mason universityresults

george mason universityto

reliability of protein

raw data of interactions

protein x

protein pairs

protein sequences

different data