typically, classifiers are trained based on local features of each site in the training set of...

Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is used when making predictions on the sites of the sequence to be annotated. In this work we seek to improve such classifiers by taking into account the global sequence similarity between the test sequence and the sequences in the training set.

Jivko Sinapov, Cornelia Caragea, Drena Dobbs and Vasant HonavarUsing Global Sequence Similarity Improves Biological Site-Specific Classifiers

Many problems in bioinformatics involve the prediction of class labels for each element in a protein sequences. Examples include:

Prediction of RNA and DNA binding protein residues Prediction of post-translational modification sites Prediction of secondary structure elements in sequences

M K L I T I L

C

FL

SRLL

PSLTQ

E SS Q E I D

Non O-Glycosylated?O-Glycosylated?

H3N+

COO-

Example Problems:Protein-RNA binding site prediction: Glycosylation site prediction:

1. Prediction of O-linked glycosylation sites2. Prediction of RNA-binding protein residues3. Prediction of protein-protein interface residues

Let xtest = {f1, f2, …,fn} be a n-dimensional test data point

Apply Bayes rule:

Independence assumption:

Assign class that maximizes:

Let S1, S2, …, SN be a dataset of protein sequences.1. Compute an N by N pair-wise similarity matrix using Global Alignment scores with Blosum62 substitution matrix

2. Using Spectral Clustering algorithm, recursively partition the set of training sequences to obtain a Hierarchical Clustering of the Sequences.

147

25 122

94 28

49 45

26 23

3. Use the structure of the hierarchical partitioning to learn a Hierarchical Mixture of Experts model such that:

Let be the leaf nodes in the hierarchical partitioning Let be the parameters for the trained Naïve Bayes models at each leaf node in Let be the input features for some residue in sequence

Each leaf node computes the class probability for xtest according to:

Each non-leaf node combines the predictions from its children:

1. Performed 10-fold sequence based cross validation2. Compared Naïve Bayes (NB) and Hierarchical Mixture of Naïve Bayes Experts (HME-NB)

O-Glycosylation Protein-RNA interactions Protein-Protein interface

Naïve Bayes HME-NB Naïve Bayes HME-NB Naïve

Bayes HME-NB

Accuracy 0.89 0.89 0.83 0.84 0.79 0.81

MCC 0.57 0.58 0.32 0.37 0.08 0.25

Sensitivity 0.61 0.65 0.24 0.31 0.06 0.18

Specificity 0.65 0.63 0.65 0.66 0.38 0.60

AUC 0.88 0.91 0.74 0.76 0.62 0.72

Sequence: DSNPKYLGVKKFGGEVVKAGNILVRQRGTKFKAGQGVGMGRDHTLFALSDGK Label: 1111110011111110011111001011111100000001111101000000

target residue

class label . . . VKKFGGEVVKAGNIL,-1 KKFGGEVVKAGNILV,-1 KFGGEVVKAGNILVR,+1 FGGEVVKAGNILVRQ,+1

. . .

Data points used for training and testing a classifier

a) O-Glycosylation

b) Protein-RNA interaction sites

c) Protein-Protein interface sites

Biological Motivation:

Datasets:Dataset Number of

SequencesNumber of+ Instances

Number of- Instances

O-GlycBaseProtein-RNA

Protein-Protein

21614742

216843362350

12147279889204

Naïve Bayes (NB):

),...,,()()|,...,,(),...,,|(

21

2121

n

nn fffP

CPCfffPfffCP

i

in CfPCfffP )|()|,...,,( 21

i

i CfPCP )|()(

• A window of 21 amino-acids centered on the target residue:Feature Representation:

Results:

lM

ll VVV ,...,, 21

M ,...,, 21

},...,,{ 21 ntest fffx

)())(|(),|(),|( g

ji igj VchildV itestitesttesttestVtesttestV

VparSVSPSxCPSxCP

i

jijtesttestVCfPCPSxCP l

j),|()|(),|(

testST

T

Hierarchical Mixture of Naïve Bayes Experts (HME-NB):

A qualitative comparison of Naïve Bayes (NB) and Hierarchical Mixture of Naïve Bayes Experts (HME-NB) on the task of predicting protein-protein interface sites of Anionic trypsin-2 precursor of Rattus norvegicus (shown in spheres) interfaced with Ecotin precursor of E.coli (in green). Each residue of the Anionic trypsin-2 precursor is colored based on whether the prediction is a True Positive (red), True Negative (gray), False Positive (blue), False Negative (yellow). For both methods, the False Positive Rate (FPR) is fixed at 0.1. HME-NB is able to achive higher TPR (0.88) than that of NB (0.56) for the same FPR.

Artificial Intelligence Research LaboratoryBioinformatics and Computational Biology Program

Computational Intelligence, Learning, and Discovery ProgramDepartment of Computer Science

Acknowledgements: This work is supported in part by a grant from the National Institutes of Health (GM 066387) to Vasant Honavar & Drena Dobbs

• Developed a classifier that improves labeling biological sequence dataConclusion:

typically, classifiers are trained based on local features of each site in the training set of...

Documents