protien structure prediction

8/13/2019 Protien Structure Prediction

1/15

Chapter 21. The Role of Protein Structure Prediction in Drug DiscoveryDavid T. Jones, Mark B. Swindells and Richard FaganDepartment of Biological Sciences, Brunel University,Uxbridge, Middlesex, UB8 3PH, U.K.+ lnpharmatica Ltd., 60 Charlotte Street, London, WIT 2NU, U.K.

Introduction - As we move into the post-sequencing phase of many genomeprojects, attention is becoming increasingly focussed on the correctidentification of gene products. Assigning a possible function to a gene is animportant first step to characterising its role in the various cellular processes,and without this information, it is impossible to realise the true value of genomesequencing. Of course, straightforward sequence comparison algorithms are byfar the most widely used techniques for making an initial identification of aparticular gene product. The identification of common ancestry between a newgene product and a gene of known function allows some inferences to be maderegarding the function of the new gene. How reliably the function can beextrapolated to the new gene depends on a number of factors, but the principlefactor is of course the degree of sequence similarity observed.

New developments in sensitive sequence comparison, particularly recentextensions of the BLAST algorithm to sequence profiles or techniques based onHidden Markov Models have resulted in the routine detection of ever moreremote homologous relationships (1,2). Of course, as more and more remoterelationships are being considered, it becomes less clear as to how reliably onecan map the function of one gene to another (3,4). Nevertheless, sensitivesequence comparison algorithms are still the most vital technology that wehave for rapidly characterising new gene products. The importance of thisdevelopment cannot be underestimated. Until its release, the ability to generateresults comparable with those now routinely produced by PSI-Blast wasrestricted to specialist researchers, who combined ad hoc combinations ofdatabase search and profile generation software and judicious hand editing (toremove over-dominant sequence clusters and to know when to stop iterating inthe absence of easily interpreted statistics).

Because PSI-BLAST is so easy to use and so universally popular, it isimportant here to also provide a word of caution (5). The algorithmsunderpinning PSI-Blast are complicated and use a variety of assumptions. Oneof these assumptions is that the sequences being compared are both globularproteins, which is often not the case. This assumption becomes an issue whentwo proteins only share a low level of sequence similarity, as unrelated non-globular proteins could achieve scores similar to those between related globular


2/15

212- Section V-Topics in Bmlogy Allen, Ed

possible structure of the protein encoded by the gene in question. In this way,3-D structural information can also be used as part of the process for identifyingdistant homologues (i.e. those which are not readily identified by PSI-BLAST).

METHODS FOR STRUCTURE PREDICTIONObtaining a protein structure experimentally can be a time consuming processwhether It is done by X-ray crystallography or by NMR spectroscopy. Given theevident importance of 3-D structure in providing insights into the function andmechani sm of proteins, it is reasonable to consider the applicability andreliability of available structure prediction techniques. Is there a role for proteinstructure prediction in structurally characterising a protein. Clearly, asatisfactory theoretical approach to accurately modelling the structure of manyproteins would have a great impact on genomics as a whole. However, if theuse of prediction algorithms is going to be generally accepted by the biologycommunity at large, then it is essential that the reliability of these methods beassesse d in such a way as to convince this rather sceptical audience. Althoughindividual authors of automatic prediction methods do attempt to properlybenchmar k their methods and attempt to provide useful measures of confidencealongside their predictions, there still remains the possibility that the publishedresults are somewhat better than might be expected in cases where the truestructure is not known. The recent Fourth Critical Assessment in StructurePrediction (CASP4) Experiment was carried out in 2000, along similar lines to theprevious 3 similar experiments, and this continues to allow some indication to begained as to the reliability of truly blind predictions using different approaches.Detailed results from the experiment will be published in a special issue of thejournal Proteins, along the same lines as for CASP3 (7). The raw data from theCASP4 evaluation is also available across the Internet ( URLhttp://predictioncenter.llnl.qov).

COMPARATIVE MODELLINGAt present, the most accurate method for predicting protein structure is to makeuse of comparative modelling techniques to infer the structure of a targetprotein based on the structure of a related template protein. Th e reliability andsimplicity of this class of method stems from the fact that it is limited topredicting the structure of proteins which are closely related to the templateprotein of known structure. The comparative modelling process can be dividedinto five basic steps: alignment of the target sequ ence with the sequenc e of aprotein of known 3-D structure; building of a framework structure based on thealignment; loop building; addition and optimization of side chains: and finallymodel refinement.


3/15

Chap 21 Protan Structure Predxtmn Jones et al 213-

PROTEIN FOLD RECOGNITION AND THREADINGIn the absence of suitable homologous template structu res with which to build amodel for a given sequence, and the slow progress that is evident in the ab initioprediction field, fold recognition algorithms provide another option for constructinguseful tertiary structural models. 3D structure information can be used as part ofthe process for identifying distant homologues (i.e. those below about 25sequence identity) as one can assess the fit of a sequence to a proteinstructure using empirically derived probabilities or propensities (also known asenergy potentials). Many of these methods are referred to as threading, ofwhich the first was THREADER (9). This algorithm has been continuouslydeveloped and refined (e.g. IO), along with related approaches developed inother laboratories (1 I), but these methods still have the following advantagesand disadvantages. Threading methods try to predict the fold of a protein in theabsence of any sequence similarity (i.e. down to 0 ) using a large libr ary offolds as its database (fold means the approximate main chain trace of a proteinstructure) . Its problem is that there exist no robust statistical assessments ofsignificance and therefore there will always be a top prediction even if the foldof the query sequence is not in the fold library. This situation is very similar tothe problems with profile based methods before the advent of PSI-BLAST.Threading is also CPU intensive and may take many hours on a relativelypowerful machine to complete a database search.

In addition to the problem of identifying true from false positives, there is alsothe problem of alignment accuracy. From the results of the CASP experiments,the conclusion we reach is that in order to produce reasonably accurate 3-Dmodels with fold recognition methods there should be an evolutionary relationshipbetween the target protein and at least one template structure of known 3-Dstructure. Despite the fact that the sample sizes in the CASP experiments aresmall (30-40 target domains), it would appear that where there is at least somedetectable sequence similarity, fold recognition methods based on sequence-profiles are presently sufficient to build useful models. Beyond these cases,however, fold recognition methods not reliant on sequence alignment (i.e. truethreading methods which ignore the sequence of the template proteins) are muchmore limited in their ability to recognize folds, and to the accuracy of the modelsthey can produce. Nevertheless, even these relatively poor models may beenough to gain some insight into the function of a new gene sequence. Even foldrecognition algorithms which are able to correct ly r ecognize folds but are entirelyincapable of producing sensible alignments may offer some advantage in thenarrowing-down of potential gene functions.

It has become clear that these algor ithms are now beginning to converge, withmany different gr oups all heavily relying on sensitive sequence comparison in


4/15


5/15

Chap 21 Prcteln Structure Prediction Jones et al 21.5-

FOLD RECOGNITION METHODS FOR GENOME ANALYSISWhen mining large amounts of sequence data, fold recognition methods cancontribute significantly to target prioritization. Depending on ones strategy, onemight for instance wish to concentrate on protein folds that are likely to bedrugable or those for which one already has an assay available. This willrmmedrately remove most of the sequence s from consideration and leave amore compact lrst for subsequent validation.

Grven the potential benefits of assigning a correct structure to a newlydiscovered gene product, it is unsurprising than several gr oups have appliedexisting fold recognition algorithms to genome analysis. These techniques canbe classified into roughly 3 classes: sequence profile methods (l-2,14),structural (3-D-l-D) profile methods (15-16) and threading algorithms (17-18).

The first attempt at assigning folds to genome sequen ces made use of astructural profile method. Fischer & Ersenberg (19) used a development of theonginal 3-D-l-D profile method (15) to assign folds to the ORFs found in M.genhlium, the smallest known bacterial genome. They found thatapproximately 16 of the ORFs could be assigned to a known fold by means ofstraightforward sequence comparison, and that an additional 6 could beassrgned to a known fold at high confidence using their fold recognition meihod.Of course, as the structure databases are now much larger, it is very likely thatthese fractions would now be somewhat higher.

Although many different threading (purely parr potential b ased foldrecognition) methods have been developed, only a single attempt at applyingthese methods to geno me analysis has been described (20). The ProFitmethod has been applied to analyse the ORFs in M. pneumoniae, a slightlylarger gen ome than M. genifalium (18). In this work, to save time, proteinswhich could be matched to known structures by straightforward sequencecomparison were excluded f rom the analysis along with proteins longer than200 residues (which were assumed to be multidomain proteins). Of the 124ORFs remaining, Grandori was able to recognize folds for 12, giving arecognition rate of 10 . Interestingly, a number of disagreements werereported when the results were compared with the results from Fischer &Eisenbergs work (by identifying M. pneumoniae homologues in M. genifalium).Thus IS not surprising given the relatively low overall reliability of pure foldrecognition algorithms. but more surprising because in some cases bothpredictions were apparently very significant.

Despite the fact that both approa ches made use of basic pair-wise sequencecomparison methods to detect obvious homologues to known structures, it isclear that better sequence comparison algorithms could have been applied, an d


6/15

216- Section V-Tapm in Biology Allen. Ed,

In another study, attempts were made to exploit this asymmetry in profilecomparisons by means of a comparison algorithm based on the alignment ofone profile with another (25, 26). This technique, BASIC, requires profiles to becomputed for each sequence in the 3-D structure library and also for each ORF.These two sets of profiles are then compared by means of a local dynamicprogramming method.

As already mentioned, Jones has developed a hybrid method for assigningfolds to genome sequences, called GenTHREADER (12), which was usedsuccessfully to assign folds to the genome of Mycoplasma genitalium, whereanalysis of the results showed that as many as 46 of the proteins derivedfrom the predicted protein coding r egions had a significant relationship to aprotein of known structure.Teichmann et al. have compared the results from several attempts atassigning folds to the M. genitalium (MG) genome (27). Being the smallestbacterial genome, MG provides a useful benchmark for different approaches tofold assignment as most groups have made predictions for this genome.

Despite the fact that it was found that a high degree of agreement was apparentbetween the different algorithms, some results were not found by alltechniques. This suggests that to maximise success in assigning folds togenomes, some kind of consensus of algorithms might be useful. At present,this is difficult as there are no agreed standards for how structural annotationsshould be represented.A number of Web resources are available which provide access to

precompiled fold assignments for different subsets of genomes. Theseresources are predominantly based on PSI-BLAST comparisons e.g. the GTOPdatabase at the National Institute of Genetics in Japan. The database containsfold assignments for 26 completed genomes based on PSI-BLAST similaritysearches, and can be accessed from the following URL:http:/lspock.qenes.nio.ac.ip/-qenome/summaty.html

To date, the only available set of comprehensive fold assignments usingfold recognition techniques are those derived by GenTHREADER. which havebeen compiled for 2.5 complete genomes (plus the currently confirmed geneproducts from the draft human genome) and have been stored in a Web-accessible database at the following URL:


7/15


8/15


9/15


10/15


11/15


12/15


13/15


14/15


15/15

protien structure prediction

Documents