college of information technology united arab emirates university (uaeu) uae [email protected]

Slide 1

College of Information Technology United Arab Emirates University (UAEU)[email protected]

Maad Shatnawi and Nazar Zaki

Prediction of Protein Inter-Domain Linkers Using Compositional Index and Simulated AnnealingAmsterdam, The Netherlands, July 06-10, 2013Nazar Zaki

OutlineIntroductionExisting methodsProposed solutionMethodCompositional indexSA optimizationExperimental resultsConclusion and future directions

Introduction

Proteins have two types of segments: domains and linkers

Predicting inter-domain linkers is very importantAccurate identification of functional domainsLess computational costClassify proteins, Predict PPI, fold prediction, transmembrane, etc

Existing methodsApproachExtracted FeaturesTechnique/Tools WeaknessesDomCut (Suyama and Ohara 2003)Linker indexLinker index profileinformation contained in linker index) is not sufficientlack of biological knowledge input.Scooby-Domain (George et al. 2005 )Domain lengths and hydrophobicitiesA*-searchA* search suffers from an exponential computational time complexity FIEFDom (Bondugula et al. 2009)PSSMFMOdid not address the issue of predicting domains with non-contiguous sequences and therefore it discarded such proteins.DROP (Ebina et al. 2011)Secondary structuresPSSM elements of hydrophilic residues and prolinesRandom forestSVMrandom forest can possibly be trapped in local minima and suffers from over-prediction

These methods use hydrophobicity analysis alone and therefore, they cannot predict TMHs with length greater than 25 residues. The recent high resolution structures production of helical membrane proteins revealed that TMH could have a wide length distribution of more than 25 residues.

However, HMM based methods are considered computationally expensive since they involve multiple sequences alignments, calculation of the profile HMM topology and parameterization, and training via expectation maximization. Moreover, theHMM based methods are unable to correctly predict TMHs shorter than 16 residues or longer than 35 residues. As for distantly related protein sequences, a profile alignment may not be possible if, for example, the sequences contain shuffled domains.

First, the learning ability drops when the datasets are small.Second, the feature extraction step requires extensive computations, and thereby a simple algorithm that does not require sequence alignments in the feature extraction step is desirable

4Proposed solutionOur approach consists of two main steps:Calculation of the compositional indexEmploying Simulated Annealing to refine the prediction

5Compositional indexCalculate the averaged compositional index values

6Compositional indexCalculate the averaged compositional index values

7Compositional index

8Compositional index (Illustration)>1LGH:B(AERSLSGLTEEEAIAVHDQFKTTFSAFIILAAVAHVLVWVWKPWF)Window size 5.

Compositional index (Illustration)>1LGH:B(AERSLSGLTEEEAIAVHDQFKTTFSAFIILAAVAHVLVWVWKPWF)Window size 5.



Compositional index (Illustration)Dynamic threshold is needed

Why Simulated Annealing (SA)?A protein sequence is seen as a set of sequence chunks.Each chunk would have its proper dynamic threshold value.This is a search problem of a set of dynamic threshold values. In other terms: partitioning a given set of positive real numbers into k subsets (k is unknown) so as to maximize an objective function.SA is known to be well adopted for partitioning problem An intuitive customization is straightforward

SA CustomizationAS is a probabilistic searching method for the global optimization of a given function in a large search space.Inspired by the annealing technique which is the heating and controlled cooling of a metal to increase the size of its crystals and reduce their defects.Ability to avoid being trapped in local optima.SA algorithms are usually better than greedy algorithms, when it comes to problems that have numerous locally optimum solutions.

SA OptimizationDivide each protein sequence into segments. The segment size was set to the average linker size among the dataset.Start from a random threshold value for each segment (starting 0.1)Calculate the AA compositional index of the input protein sequence.Classify each AA as linker or domain according to its compositional index value with respect to the corresponding segment threshold.Calculate recall and precision.Randomly increase or decrease the threshold value of a segment.SA accepts or rejects the transition in order to maximize both the recall and precision of the linker segment prediction.

Optimal threshold values for XYNA_THENE protein sequence in DomCut dataset which contains 133 AASEvaluation MeasuresRecall is the proportion of correctly predicted linkers to all of the structure-derived linkers listed in the dataset

Precision is defined as the proportion of correctly predicted linkers to all of the predicted linkers

Experimental Results

DatasetsExperimental ResultsApplying the proposed method on Dataset (1)Experimental ResultsApplying the proposed method on Dataset (2)ConclusionWe examined the amino acid compositional index to predict protein inter-domain linker segments from amino acid sequence information.We employed simulated annealing to improve the prediction by finding the optimal set of threshold values that separate domains from linker segments.Experimental results show that the proposed method outperformed the currently available approaches for inter-domain linker prediction in terms of recall and precision.

ConclusionThis work can be extended by examining different sliding window sizes in computing AA compositional index.Additional SA parameter tuning and use of dynamic segment sizes.Combine compositional index with other features such as PSSM, AA physiochemical properties, hydrophobicity can be examined.

Thank you

college of information technology united arab emirates university (uaeu) uae [email protected]

Documents