protein structure prediction: model building and quality ...196554/fulltext01.pdf · use of a...

Protein Structure Prediction: ModelBuilding and Quality Assessment

BJORN WALLNER

Stockholm Bioinformatics CenterDepartment of Biochemistry and Biophysics

Stockholm UniversitySweden

Stockholm 2005

Bjorn WallnerStockholm UniversityAlbaNova University CenterStockholm Bioinformatics CenterS-106 91 StockholmSwedene-mail: [email protected]: +46-8-55 37 85 66Fax: +46-8-55 37 82 14

Akademisk avhandling som med tillstand av Stockholms Universitet iStockholm framlagges till offentlig granskning for avlaggande av filosofiedoktorsexamen fredagen den 30 september 2005 kl 10.00 i Svedbergssalen,AlbaNova Universitetscentrum, Roslagstullsbacken 21, Stockholm.

Faculty opponent: Michael Levitt, Stanford University School of Medicine.

Doctoral Thesisc© Bjorn Wallner, 2005ISBN 91-7155-121-2

Printed by Universitetsservice US AB, Stockholm 2005

To my family

Protein Structure Prediction: Model Building and Quality AssessmentBjorn Wallner, [email protected] Bioinformatics CenterStockholm University, SE-106 91 Stockholm, Sweden

Abstract

Proteins play a crucial roll in all biological processes. The wide rangeof protein functions is made possible through the many different con-formations that the protein chain can adopt. The structure of a proteinis extremely important for its function, but to determine the structureof protein experimentally is both difficult and time consuming. In factwith the current methods it is not possible to study all the billionsof proteins in the world by experiments. Hence, for the vast majorityof proteins the only way to get structural information is through theuse of a method that predicts the structure of a protein based on theamino acid sequence.

This thesis focuses on improving the current protein structure predic-tion methods by combining different prediction approaches togetherwith machine-learning techniques. This work has resulted in some ofthe best automatic servers in world – Pcons and Pmodeller. As apart of the improvement of our automatic servers, I have also developedone of the best methods for predicting the quality of a protein model –ProQ. In addition, I have also developed methods to predict the localquality of a protein, based on the structure – ProQres and basedon evolutionary information – ProQprof. Finally, I have also per-formed the first large-scale benchmark of publicly available homologymodeling programs.

Contents

1 Publications 7

1.1 Papers included in this thesis . . . . . . . . . . . . . . 7

1.2 Other publications . . . . . . . . . . . . . . . . . . . . 7

2 Introduction 9

3 Protein Structure Prediction 10

3.1 De Novo . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Comparative modeling . . . . . . . . . . . . . . . . . . 13

3.2.1 Finding related structures . . . . . . . . . . . . 14

3.2.2 Aligning related structures . . . . . . . . . . . . 16

3.2.3 Model building . . . . . . . . . . . . . . . . . . 17

3.2.4 Model evaluation . . . . . . . . . . . . . . . . . 19

3.3 Meta prediction . . . . . . . . . . . . . . . . . . . . . . 20

4 Techniques and databases 21

4.1 SCOP – Structural Classification Of Proteins . . . . . . 22

4.2 Neural networks . . . . . . . . . . . . . . . . . . . . . . 22

4.2.1 Input data encoding . . . . . . . . . . . . . . . 23

4.2.2 Network architecture . . . . . . . . . . . . . . . 24

4.2.3 Neural network training . . . . . . . . . . . . . 27

4.2.4 Cross-validation . . . . . . . . . . . . . . . . . . 29

4.2.5 Other issues when using neural networks . . . . 30

4.2.6 The neural networks in this study . . . . . . . . 30

4.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3.1 SCOP based . . . . . . . . . . . . . . . . . . . . 31

4.3.2 CASP . . . . . . . . . . . . . . . . . . . . . . . 32

5

4.3.3 CAFASP . . . . . . . . . . . . . . . . . . . . . . 34

4.3.4 LiveBench . . . . . . . . . . . . . . . . . . . . . 35

5 Present investigation 35

5.1 Development of a method that identifiescorrect protein models (Paper I) . . . . . . . . . . . . . 36

5.2 Pcons, ProQ and Pmodeller at CASP5 / CAFASP3(Paper II) . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.3 Detecting correct residues with ProQres and ProQprof(Paper III) . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.4 Pcons5: the fifth generation of Pcons(Paper IV) . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.5 Are all homology modeling programs equal? (Paper V) 42

6 Conclusions and reflections 44

7 Acknowledgments 45

6

1 Publications

1.1 Papers included in this thesis

Paper IBjorn Wallner and Arne Elofsson. Can correct protein models be iden-tified?Protein Science 12(5):1073-1086, 2003.

Paper IIBjorn Wallner, Huisheng Fang and Arne Elofsson. Automatic consen-sus based fold recognition using Pcons, ProQ and Pmodeller.Proteins, 53 Suppl 6:534-541, 2003.

Paper IIIBjorn Wallner and Arne Elofsson. Identification of correct regions inprotein models using structural, evolutionary and consensus informa-tion.Submitted, 2005.

Paper IVBjorn Wallner and Arne Elofsson. Pcons5: combining consensus, struc-tural evaluation and fold recognition scores. Submitted, 2005.

Paper VBjorn Wallner and Arne Elofsson. All are not equal: A benchmark ofdifferent homology modeling programs.Protein Science, 14(5):1315-1327, 2005.

Papers are reproduced with permission.

1.2 Other publications

Bjorn Wallner, Huisheng Fang, Tomas Ohlson, Johannes Frey-Skottand Arne Elofsson. Using evolutionary information for the query andtarget improves fold recognition.Proteins, 54(2):342-350, 2004.

Tomas Ohlson, Bjorn Wallner and Arne Elofsson. Profile-profile meth-ods provide improved fold-recognition. A study of different profile-

7

profile alignment methods.Proteins, 57(1):188-197, 2004.

Ann-Charlotte Berglund, Bjorn Wallner, Arne Elofsson and David A.Liberles. Tertiary Windowing to Detect Positive Diversifying Selec-tion.J.Mol.Evol., 60(4):499-504, 2005.

Huisheng Fang, Bjorn Wallner, Jesper Lundstrom, Christer von Wow-ern and Arne Elofsson. Improved fold recognition by using the Pconsconsensus approach.Chapter 16 in Protein Structure Prediction:Bioinformatic Approach,IUL, 2002.

8

2 Introduction

This work is about the prediction of the structure of proteins fromthe amino acid sequence. Proteins play a crucial roll in all biologicalprocesses and constitute a central part of the cell machinery. They arethe active components that catalyze various processes, and in that waythey control and direct chemical pathways underlying all processes ofthe living cell. They can act as transporters, transporting oxygen inthe blood or controlling the flow of ions across the cell membrane. Theyalso provide mechanical support for the cell and the whole body. Thiswide range of functions is made possible through the many differentconformations that proteins can adopt. Thus, the structure of a proteinis extremely important for its function.

Proteins are build from 20 different amino acids forming a chain ofvariable length. The different amino acids have different properties andwhen put together they interact forming a three-dimensional structure.The order of the amino acids also called the sequence uniquely definesthis structure (Anfinsen, 1973). To obtain the sequence of a protein istoday relatively easy, however to determine the structure of a proteinexperimentally using X-ray crystallography or NMR spectroscopy isboth difficult and very time consuming. The high-throughput methodsdeveloped in the last few years have increased the rate of structuresdetermined experimentally, still it is unreasonable to believe that thestructure of more than a tiny fraction of all the billions of proteins inthe world will be studied by experimental methods in the foreseeablefuture. Hence, for the vast majority of proteins the only way to getstructural information is through the use of methods that predict thestructure of a protein based on the amino acid sequence.

The structure of a protein depends solely on the amino acid sequence.This is property is extremely important when predicting the structurebased on the sequence, because similar sequences will have similarstructure. Also similar structures most often also have similar func-tion, which means that by comparing sequences of proteins it is possi-ble to say that two proteins not only have the same structure but alsothe same function. This forms the basis for homology based structureand function prediction, which today is routinely used to infer func-tion from sequences with known function to sequences with unknown

9

function. However, how to compare sequences, is by no way trivial andmany different methods have been developed throughout the years.

This thesis focuses on improving the current protein structure predic-tion methods by combining different prediction approaches togetherwith machine-learning techniques. In particular we have developedmethods to predict the quality of a predicted structure using neuralnetworks. These predictions have successfully been combined withconsensus analysis of predicted structures from existing methods.

In the following, the protein structure prediction problem will be de-scribed and different approaches how to solve the problem will be ad-dressed (section Protein structure prediction). Also some techniquesand databases relevant for the development (section Techniques anddatabases) and benchmarks (section Benchmarks) of our predictionmethods will be introduced. Finally, my contribution to the proteinstructure prediction field will be summarized (section Present investi-gation) and also presented in full (Papers I-V).

3 Protein Structure Prediction

The three-dimensional (3D) structure of a protein is guided by twodistinct sets of principles operating on vastly different time scales:the laws of physics and the theory of evolution (Figure 1). From aphysical point of view the protein molecule is a consequence of a va-riety of forces, such as entropy, chemical bonds, Coulomb and vander Waals interactions. Under native conditions these forces fold aprotein into a stable well-defined 3D structure in less then a second.From an evolutionary point of view the protein molecule is a result ofevolution over millions of years. Proteins evolve through duplication,speciation, horizontal transfer and mutations. During the course ofevolution a protein changes gradually because its function is usuallyretained, which requires the conservation of its structure and thereforealso its sequence (Chothia & Lesk, 1986).

These two viewpoints give rise to two different approaches to the pro-tein structure prediction problem (Baker & Sali, 2001): de novo andcomparative modeling. The first approach predicts the structure from

10

EvolutionComparative Modeling

PhysicsDe Novo prediction

EQLTKCEVFQKLKDLKD....

foldinghum

an

baboon

mouse

cow

Figure 1: Proteins obey two sets of principles, the laws of physics and the theory ofevolution. These two principles give rise to two different approaches to the proteinstructure prediction problem, de novo prediction and Comparative modeling.

the sequence alone, while the second approach relies on finding a sim-ilar sequence with known structure. Both these approaches will bedescribed in more detail below. Let me just first mention that anyprotein structure prediction method can be seen as an optimizationof a protein structure model with respect to a certain objective func-tion (Sanchez & Sali, 1997). De novo methods attempt to find themost likely structure of a protein sequence given the physical forcesbetween the atoms in the protein, while comparative modeling meth-ods attempt to find the most likely structure of a protein sequencegiven its relationship to known similar protein structures.

11

3.1 De Novo

De novo methods predict the structure from the sequence alone, with-out relying on similarity between the modeled sequence and any knownstructure. These methods assume that the native structure corre-sponds to the global free energy minimum of the protein and attemptto find this minimum by an exploration of many possible protein con-formations. Two key steps in de novo methods are the procedure forefficiently carrying out the conformational search and the free energyfunction used for evaluating possible conformations.

This class of methods was earlier called ab initio, because they pre-dicted the protein structure from scratch without prior knowledge.However, because most of the so called ab initio methods in fact usedprior knowledge, nowadays the name de novo is used to denote thiscategory. De novo, referring to predicting structures with new folds.

The most successful method in the de novo category is the Rosettamethod (Simons et al., 1999; Bonneau et al., 2001) developed by DavidBaker and co-workers. It uses 3 and 9 residue fragments of knownstructures with local sequences similar to the sequence to be modeled.It assembles these fragments into complete tertiary structures using aMonte Carlo simulated annealing procedure. The scoring function usedin the simulated annealing procedure consists of sequence-dependentterms representing hydrophobic burial and specific pair interactionssuch as electrostatics and disulfide bonding and sequence-independentterms representing hard sphere packing, α-helix and β-strand packing,and the collection of β-strands in β-sheets.

The de novo methods have improved in the last few years, in CASP1and CASP2 they showed little success, however in CASP3 some denovo methods outperformed fold recognition methods for certain pro-teins in the fold recognition category (Bonneau & Baker, 2001), (fora description of CASP see the Benchmark section) . At CASP4 thede novo prediction were more consistent and more accurate than atprevious CASPs (Bonneau et al., 2001). However, at CASP5 therewas no significant increase in performance compared to CASP4 (Ven-clovas et al., 2003). This is probably because the main improvementin CASP3 was the introduction of the fragment approach, which wasthen further optimized and led to an improvement also at CASP4. At

12

CASP5 there were no really new ideas in the de novo field and conse-quently also no performance increase. Thus, despite the initial progressmany issues must be resolved if a consistent and reliable de novo pre-diction scheme is to be developed. For instance no method performsconsistently across all classes of proteins, most methods perform worseon all-β proteins, and all methods fail completely on sequences longerthan 150 residues.

3.2 Comparative modeling

The aim of all comparative modeling methods is to build a 3D modelfor a protein of unknown structure (the target) on the basis of sequencesimilarity to proteins of known structures (the templates). This is pos-sible because small changes in the protein sequence usually result insmall changes in its 3D structure (Chothia & Lesk, 1986). Despite theprogress in the de novo prediction field (see above), the comparativemodeling approach is by far much more accurate than de novo predic-tions. The overall accuracy of comparative models spans a wide rangefrom low resolution models with only a correct fold to more accuratemodels comparable to medium resolution structures determined by X-ray crystallography or NMR spectroscopy. Even low resolution modelscan be useful in biology as some aspects of function can sometimes bepredicted only from the coarse structural features of a model.

Today (Mar-2005), it is possible to find structures with similar se-quences for one third of all proteins (Ekman et al., 2005). Because thenumber of known protein sequences is approximately 2,000,000 (Bairochet al., 2005) (05-Jul-2005), comparative modeling could be applied tomore than 650,000 proteins. This number should be compared to ap-proximately 30,000 protein structures determined experimentally (Ber-man et al., 2000). The usefulness of comparative modeling is alsosteadily increasing because the number of different structural folds thatproteins adopt is limited and since the number of experimentally de-termined new structures is increasing rapidly (Holm & Sander, 1996).This trend is also accentuated by the recently initiated structural ge-nomics project that aims at determining at least one structure for mostprotein families (Vitkup et al., 2001). It is conceivable that structuralgenomics will achieve its aim in 5-10 years, making comparative mod-

13

Evaluate the model

Identify related structures(templates)

Align target sequence totemplate structures

Build a model for thetarget sequence

using information from template structures

START

modelOK?

NO

YES

END -0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 20 40 60 80 100 120 140

QU

ALIT

Y

RESIDUE POSITION

protein model X

.......K

QFTKC

ELSQ

NLYD

ID

GYGRIALPELICTMFH

TSGYDTQAIVENDEST

EYG

LFQ

ISN

KQFTKCELSQNLYF...

TEMPLATE STRUCTURETARGET SEQUENCE

ALIGNMENTTARGET ...SCDKLLDDELDDDIACAKKILAIKGID...TEMPLATE ...SCDKFLDDDITDDIMCAKKILDIKGID...

TARGETMODEL

QUALITYPROFILE

Figure 2: The different steps in comparative modeling.

eling applicable to most protein sequences.

Most comparative modeling methods consists of four steps (Figure 2):finding known structures related to the target sequence to be modeled(i.e. templates); aligning the sequence to the templates; building a 3Dmodel; assessing the model (Marti-Renom et al., 2000). All four stepscan of course be repeated until a satisfactory model is obtained. Inthe following each of these steps will be described in more detail.

3.2.1 Finding related structures

The starting point in comparative modeling is the identification ofprotein structures related to the target sequence (homologs), and theselection of a suitable template structure. The knowledge that twoor more proteins are homologous is, besides being the basis for pro-tein structure modeling, the most commonly used method to transfer

14

functional knowledge from one protein to another and it can also beused for evolutionary studies. Besides the importance to detect ho-mologous proteins, many bioinformatical methods and tools are basedon these relationships. For instance, the reliable detection of relatedproteins might actually improve secondary structure prediction (Rost& Sander, 1993) and other prediction methods. This is also the rea-son why the identification of related sequences has been subjected tointensive research during last 10 years, resulting in a wide range ofdifferent methods and different approaches to the problem. Many ofthese methods are available as web servers on the Internet. An ex-tremely useful site is the Meta-server at http://bioinfo.pl/meta, whereit is possible to submit a sequence to many different servers using asimple interface.

The basic approach to find relationships between proteins is to comparetheir sequences. This can be performed in many different ways, rang-ing from simple pairwise comparisons to more elaborated comparisonsusing evolutionary information for either the target or the template orfor both. The output from all of these approaches is the same: a listof which amino acids in the two sequences that are equivalent, this listis also called the alignment. It has been demonstrated that methodsusing multiple sequences, i.e. evolutionary information, are superiorto methods that only use single sequences (Lindahl & Elofsson, 2000)and more recently that methods that use evolutionary information forboth the target and template sequences are even more efficient (Wall-ner et al., 2004). Probably because the description of which part of thesequences that are conserved and variable throughout the evolution arebetter described. However, simpler methods might have computationaladvantages.

The most commonly used program to find related structures is psi-blast(Altschul et al., 1997). This program uses evolutionary informationfor the target sequence by using the initial database hits to refine thesearch to find more homologs in an iterative way. For a given sequencethe first database hits are used to build a multiple sequence alignment,a position-specific scoring matrix is constructed from the alignment,and this matrix is then used to search the database for new homologs.These new homologs can then be used to construct a new matrix anddo the search again until no more homologs are detected. psi-blast

15

is quite fast making it possible to search a database with 2,000,000sequences in just a few minutes.

In most cases psi-blast detects a sufficient number of homologs, how-ever there are slower methods that are significantly better, especiallyfor the really hard targets. These methods use evolutionary informa-tion for both the target and template sequences. The evolutionaryinformation is encoded in a profile, which is constructed from a multi-ple sequence alignment (for instance using psi-blast). A profile is aset of vectors, where each vector contains the frequency of each typeof amino acid in a particular position of the multiple sequence align-ment. In this way it is possible to detect which parts of a sequence thatare conserved and variable. Methods using evolutionary informationfor both the target and template compare two profiles and are there-fore often called profile-profile methods (Rychlewski et al., 2000; vonOhsen & Zimmer, 2001; Yona & Levitt, 2000; Sadreyev & Grishin,2003; Ohlson et al., 2004). Profile-profile comparisons can be donein several different ways, including calculating the sum of pairs, thedot-product, or a correlation coefficient between the two vectors.

Once a list of related structures is obtained, the templates that are ap-propriate for the given modeling problem should be selected. Usually,a higher overall sequence similarity between the target and templatesequence yields a better template. However, there are also other fac-tors that should be taken into account in the final template selection,such as ligands, quaternary interactions, resolution and the R-factor ofa crystal structure. Ultimately the criteria for template selection de-pend on the purpose of the comparative model. For instance, to modela protein-ligand, selecting a template with similar ligand is probablymore important than the resolution of the template, while the modelingof an active site of an enzyme might require a high resolution template.In many cases, many different templates are used and the best modelsare selected using other criteria (see Model evaluation section).

3.2.2 Aligning related structures

Most methods for finding related structures also produce an alignmentbetween the target sequence and the sequence for the template struc-ture. For closely related protein sequences with identities over 40%,

16

the alignment is most often close to optimal. As the sequence similar-ity decreases the alignment becomes more difficult and will contain anincreasingly large number of gaps and alignment errors (Rost, 1999;Marti-Renom et al., 2000; Elofsson, 2002). It is extremely importantthat the alignment is correct since no modeling program can recoverfrom an incorrect alignment. Since the search methods are optimizedfor detection of remote relationships and not for optimal alignments,the alignment from the database search is often not the optimal target-template alignment for comparative modeling. Therefore, a specializedmethod could be used to align the target sequence with the templatestructures (Elofsson, 2002).

In difficult alignment cases evolutionary information together withstructure information of alternative templates should be used. Thestructural information could help avoid putting gaps in secondarystructure elements, in buried regions or between residues that are farin space. Secondary structure prediction for the target sequence couldalso be used to obtain a more accurate alignment (Elofsson, 2002).However, in many cases the evaluation of a model is more reliablethan an evaluation of an alignment. Thus, a good strategy is to build3D models for all alternative alignments, evaluate the correspondingmodels and select the best model according to the 3D model evaluationinstead of the alignment score (Jones, 1999a) (Paper I).

3.2.3 Model building

Given a target-template alignment there exists a variety of methodsthat can build a 3D model for the target protein. In principle they canbe grouped into three different groups: rigid-body assembly (Browneet al., 1969; Blundell et al., 1987; Greer, 1990; Koehl & Delarue, 1994a;Koehl & Delarue, 1995; Bates et al., 2001; Petrey et al., 2003; Schwedeet al., 2004), segment matching (Jones & Thirup, 1986; Claessens et al.,1989; Unger et al., 1989; Levitt, 1992) and modeling by satisfaction ofspatial restraints (Havel & Snow, 1991; Sali & Blundell, 1993; Srini-vasan et al., 1993; Aszodi & Taylor, 1996). The accuracy for the differ-ent modeling approaches is similar when used optimally (Marti-Renomet al., 2000). Other factors like template selection and alignment ac-curacy usually have a larger impact on model quality than the choice

17

of modeling method. Still there are differences between the modelsproduced by the different methods (see Paper V).

Modeling by rigid-body assembly. The first modeling programswere based on rigid-body assembly methods. This is still a widelyused approach, e.g. in the following programs: swiss-model (Schwedeet al., 2004), nest (Petrey et al., 2003), 3d-jigsaw (Bates et al., 2001)and Builder (Koehl & Delarue, 1994a; Koehl & Delarue, 1995). Themodel is assembled from a small number of rigid bodies obtained fromthe core of the aligned regions. The assembly starts with building aframework from the coordinates of the aligned Cα atoms. The assemblyinvolves fitting the rigid bodies onto the framework and rebuilding thenon-conserved parts, i.e. loops and side-chains. The main differencebetween the different rigid-body assembly programs lies in how side-chains and loops are built. nest uses a stepwise approach, changingone evolutionary event from the template at a time, while 3d-jigsawand Builder use mean-field minimization methods (Koehl & Delarue,1996).

Modeling by segment matching. The basis for modeling by seg-ment matching is the fact that most hexamers (oligopeptides of sixamino acid residues) occurring in protein structures can be clusteredinto 100 structural classes (Unger et al., 1989). This means that a3D model can be constructed by using a subset of atomic positions,derived from the target-template alignment as a guide to find match-ing segments in a representative database of all known protein struc-tures (Jones & Thirup, 1986; Claessens et al., 1989; Levitt, 1992). Thisapproach can construct both main-chain and side-chain atoms and canalso model insertions and deletions. This idea is implemented in theprogram SegMod (Levitt, 1992).

Modeling by satisfaction of spatial restraints. The methodsusing modeling by spatial restraints derives restraints from the target-template alignment. This is usually done by assuming that the corre-sponding distances and angles between aligned residues in the templateand the target structure are similar. These restraints are then sup-plemented by stereochemical restraints on bond lengths, bond angles,dihedral angels and non-bonded atom-atom contacts using a molecularmechanics force field. The model is then obtained by minimizing theviolations of all restraints. One of the most frequently used modeling

18

programs, Modeller (Sali & Blundell, 1993), uses this approach. Afurther benefit of this approach, is that it is easy to add restraintsderived from a number of different sources to the restraints derivedfrom the alignment. For instance, predicted secondary structure or re-straints obtained from NMR experiments or fluorescence spectroscopy.In this way, the 3D models could be improved by making use of avail-able experimental data.

3.2.4 Model evaluation

The quality of the predicted models determines the information thatcan be extracted from it. Thus, it is important to estimate the accu-racy of a model before it is used. The model can be evaluated as awhole or in individual regions. There are many different kinds of er-rors that can occur in a protein model, including mistakes in side-chainpacking, relatively small shifts and distortions in correctly aligned re-gions, errors in the unaligned regions (often loops), alignment errorsand fold assignment mistakes.

The first step in the evaluation process is usually to assess if the pro-tein has the correct fold. This will be true if the correct template ischosen and if the target-template alignment is approximately correct.The score from the database search is usually a good measure to de-cide whether the correct fold has been found or not, conserved keyfunctional or structural residues in the target sequence could also beused.

Once the fold has been established a more detailed analysis of themodel quality can be obtained by the similarity between the targetand template sequence. The sequence identity between the target andtemplate sequence is a fairly good measure of model quality if theidentity is above 30%. If the sequence identity is below 30% it becomesunreliable as a quality measure and in these cases model evaluationmethods are especially useful.

A basic requirement for a model is to have good stereochemistry. Thisis usually not a problem, not even for totally incorrect model, sincemost modeling programs use a force field to enforce proper stereo-chemistry.

19

Distribution of structural features from known protein structures canalso be used. Large deviations from the most likely values are usuallya strong indicator of errors in the model. Such features include pack-ing (Gregoret & Cohen, 1991), formation of a hydrophobic core (Bryant& Amzel, 1987), residue and atomic solvent accessibility (Koehl & De-larue, 1994b), spatial distribution of charged groups (Bryant & Lawrence,1991), distribution of atom-atom contacts (Colovos & Yeates, 1993)and main-chain hydrogen bonding (Laskowski et al., 1993). There arealso methods that combine the different criteria listed above. Pro-saII (Sippl, 1993) uses the probability to detect two residues at a spe-cific distance from each other. Verify3d (Bowie et al., 1991; Luthyet al., 1992) uses 3D profiles derived from the local environment aroundeach residue.

There has been some progress in the more physics-based approachesto model evaluation (Lazaridis & Karplus, 1999; Lazaridis & Karplus,2000). It has recently been shown that using the generalized Bornsolvation model (Still et al., 1990) as a description of solvent effects,physical energy functions can be used to identify the native-like con-formation (Felts et al., 2002; Dominy & Brooks, 2002). However, formodel evaluation purposes they have not reached the same accuracyas the methods based on statistics on known protein structures.

3.3 Meta prediction

The increasing variety of protein structure prediction methods has ledto a series of different meta-predictors or consensus predictors. Thesemethods does not make new predictions, instead they combine informa-tion from different methods in order to increase performance. The basicidea is that if several independent methods produce similar models, itis a strong indicator that the models are correct. The first consensuspredictor was Pcons (Lundstrom et al., 2001), since then a number ofother predictors has been developed, e.g. 3d-shotgun (Fischer, 2003)and 3d-jury (Ginalski et al., 2003). In a way the meta-predictors tryto mimic the work of an expert, that looks at many different alternativemodels before making the final selection, see Figure 3. By using thisapproach around 10% more correct predictions are generated, but per-haps even more important is the increased specificity. Meta-prediction

20

FFAS03

ORFEUS

PDB-BLAST

RANK1 RANK2 RANK3

EASY

FFAS03

ORFEUS

PDB-BLAST

RANK1 RANK2 RANK3

HARD

protein expert

Figure 3: Meta-predictors mimic the work of an expert that looks at many alter-native models before making the final selection. Here, two examples are illustrated,one easy and one hard, in both cases the meta-predictor makes the correct choice. Inthe easy case it does not really matter which model that is selected since all of themare more or less correct. In the hard case, on the other hand, most of the selectionswill be incorrect. It is also here the meta-servers have proven to be most useful.

is also an excellent way to evaluate different alternative models andin combination with structural evaluation the performance can be in-creased even further (Paper I and II).

4 Techniques and databases

In this section, the main techniques and resources that were used inthe actual development of the predictors in Paper I-IV will be de-scribed. This involve a description of the protein classification database– SCOP, a description of neural networks as well as a description ofdifferent types of benchmarks that can be used to assess the results.

21

4.1 SCOP – Structural Classification Of Proteins

The aim of the SCOP database (Murzin et al., 1995) is to provide adetailed and comprehensive description of the structural and evolu-tionary relationships between all proteins in the Protein Data Bank(PDB). The quality of the database is very high since the classifica-tion is constructed manually by visual inspection and comparison ofstructures, but with the assistance of tools to make the task manage-able and help provide generality. On the negative side, it is not sofrequently updated.

Each protein is classified on four hierarchical levels, family, superfam-ily, fold and class. Proteins within a family have a clear commonevolutionary relationship. They either have residue identities of 30%and greater, or lower sequence identity but very similar structure andfunction. That two families share the same superfamily means theproteins based on their structure and functional features most likelyhave a common evolutionary origin. Proteins with the same fold havethe same major secondary structures in the same arrangement andwith the same topological connections. Different folds are groupedinto seven pre-defined classes like, all-α-helices, all-β-sheet or α andβ.

The latest release of the database (1.69 release, July 2005) contains70 859 protein domains from 25 973 PDB entries grouped into 2 845families, 1 539 superfamilies and 945 folds.

4.2 Neural networks

Here, the machine learning technique will be described that was usedin the development of the predictors in Paper I-IV. Neural networksis a standard technique used in pattern recognition and classificationthat has been used for many years in a wide variety of applications,ranging from investment analysis to engine management and medicalphenomena. Neural networks are inspired by how the brain is built upby interconnected neurons, even though the number of neurons andconnections are much fewer than in the case of the brain. The mostcommon type of neural network is the feed-forward network with sig-

22

modal nodes and error back-propagation for training. The feed-forwardnetwork provides a general framework for representing non-linear func-tional mappings from a set of input variables to a set of output vari-ables. In biology, neural networks have been applied successfully tothe prediction of secondary structure (Qian & Sejnowski, 1988; Rostet al., 1994; Jones, 1999b), subcellular localization (Emanuelsson et al.,1999), the recognition of post translational modifications (Blom et al.,1999) and many more. Below the feed-forward networks will be in-troduced in more detail, however for a more thorough introductionsee (Bishop, 1995).

4.2.1 Input data encoding

A very important issue when using neural networks is how to encodethe input data. The input data should capture general features andshould also be normalized allowing for comparison between differentexamples. In our case we have performed prediction on protein struc-ture models. The models consist in principle of a list of 3D coordi-nates for all atoms in the protein structure. There are numerous waysto translate a list of coordinates for atoms (with different properties)to numbers that the neural networks can use. On top of this, sinceour goal is to decide the quality of a protein model, there will alwaysbe a question of what to consider as correct and what to consider asincorrect. A number of protein quality measures have been developedthroughout the years, for a review see (Cristobal et al., 2001). All ofthem compare the model to the correct structure and if the model andthe correct structure are similar the quality should be high. This is alsotrue for most measures, when the model and the correct structure aresimilar enough. However, most disagreements occur when the modeland the correct structure are not so similar, but still contain simi-lar “features”. However, large-scale benchmarks like LiveBench (Bu-jnicki et al., 2001a; Bujnicki et al., 2001b; Rychlewski et al., 2003)and CASP (Moult et al., 2001) (see Benchmark section) have pushedthe development of measures to a set that most researches in the fieldcan agree on. The focus of this thesis has not been the developmentprotein quality measures. Instead we have used established measuressuch MaxSub (Siew et al., 2000) and LGscore (Cristobal et al., 2001)

23

ARGTRP TYR

TYR

LYSresidue-residue

contacts

atom-atomcontacts

surface accessibility

correspondencewith predicted

secondary structurePSI-PRED

P(helix)=0.73P(sheet)=0.20P(coil)=0.07 LYS TYR ARG

LYS 0.10 0.07 0.03ARG 0.07 0.05 0.01TYR 0.03 0.01 0.02

C N O C 0.06 0.08 0.05 N 0.08 0.05 0.10 O 0.05 0.10 0.02

<25% >75%LYS 0.85 0.07ARG 0.07 0.70 TYR 0.65 0.15

exposure

Figure 4: Input parameters. The protein model is described by structural featuresthat can be calculated based on the positions of the atoms. These features includeatom-atom contacts, residue-residue contacts, surface accessibility

and focused on the encoding of the input parameters.

As stated above the encoding of the input parameters can be donemany different ways. We have focused on simple features such as con-tacts between different types of atoms or residues and where differenttypes of amino acids are situated in the protein, on the surface or inthe interior (Figure 4). For more details see Paper I and III.

4.2.2 Network architecture

All neural networks consist of nodes (neurons) that are connected toeach other through synapses with different weights (Figure 5). Theartificial neuron works in similar way as a neuron in the brain, it cal-culates a weighted sum from the inputs connected to it and outputsa value, zj, which depends on the sum. The different weights on the

24

y1

y2

ym

yk Σ

w1j

w2j

wmj

wkj

Node j

aj g(aj) zj

Figure 5: A neuron.

synapses, wij, correspond to the fact that also biological synapses havedifferent impact on the post-synaptic neuron (excitatory and inhibitorysynapses). The weighted sum is processed through an activation func-tion, g(v), before output. This function can be of many different typese.g. linear, threshold or sigmodal. The logistic function belonging tothe sigmodal type is perhaps the most common, it is defined as:

g(aj) = zj =1

1 + exp (−aj)(1)

where aj is the weighted sum

aj =∑

i

wijyi (2)

Note that the activation function lies in the range (0,1) and is bothcontinuous and differentiable. All these properties are useful in thederivation of the training algorithm. For the prediction of real values(in contrast to classification) it is quite common to use a linear acti-vation function at the output node, not to limit the range of possibleoutputs.

The neurons in a feed-forward network are organized in layers, thereare three types of layers: input, hidden and output (Figure 6). The

25

Input layerIn

put d

ata

Synapses

Ouput layerHidden layer

Output

Figure 6: A feed-forward neural network. The input data is presented to the inputnodes in the input layer, processed through the synapses and the hidden nodes inthe hidden layer to the output node in the output layer.

network input is presented to the input neuron layer, processed throughthe synapses and the activation function at the hidden layer before itreaches the output neuron(s). There are no feedback loops or connec-tions between neurons at the same level, there is only one directionof the data flow and this is also the reason why the network is calledfeed-forward. Further, this also ensures that the network outputs canbe calculated as explicit functions of the inputs and the weights. Un-fortunately, it is not possible to decide which network architecture thatis optimal for a given problem. Thus, to achieve optimal performancemany different networks with different architectures need to be trainedand optimized.

26

4.2.3 Neural network training

The training of a neural network involves an iterative procedure ofminimization of an error function, with adjustments of the weightsbeing made in a sequence of steps. The first of these steps is to initializeall weights to random values. Then, the input data of the trainingexamples are presented at the input nodes and feed forward in thenetwork to the output node, where the result is compared to the desiredoutput using an error function. Based on the difference between theactual and desired output the weights are adjusted to capture the mainfeatures in the data. The algorithm for adjusting the weights is calledthe error back-propagation algorithm (Rumelhart et al., 1986), thename comes from the fact that the errors are propagated backwards inthe network from the output enabling adjustment of the weights alsoat the input nodes.

The error signal, ej, at node j is

ej = zj − dj (3)

where dj is the desired and zj the actual output at the node. Thereexists a wide range of error functions, E, such as sum-of-squares, cityblocks and cross-entropy error function. The simplest and most com-monly used is the sum-of-squares defined as

E =1

2

∑j

e2j (4)

where j runs over all output nodes. The error function is a function ofall weights in the network, the idea is to impose a change ∆wij to theweights that is proportional and opposite to the gradient of the errorfunction:

∆wij = −η∂E

∂wij(5)

where η is the learning rate, monitoring how fast the weight adjustmentshould be. Using the chain rule

∂E

∂wij=

∂E

∂aj

∂aj

∂wij= δj

∂aj

∂wij(6)

27

where δj denotes the local gradient associated with neuron j, fromequation 2 we get

∂aj

∂wij= yi {in eq : 5} =⇒ ∆wij = −ηδjyi (7)

This function states that in order to get the weight adjustment for wij

(weight connecting node i with node j) δj should be multiplied by yi

(the value at the input end of the weight).

For the output nodes the evaluation of δj is straightforward

δoutputj = g′(aj)

∂E

∂zj= g′(aj)ej (8)

and if j is a neuron in the hidden layer

δhiddenj = g′(aj)

∑k

δkwkj (9)

where k runs over all output nodes.

A presentation of all the input patterns once is called a training cycle.The weight adjustment is either done after presentation of each pattern(on-line learning) or after first summing the derivatives over all thepatterns in the training (batch learning).

The weights are updated in order to give minimal errors when the inputtraining data is presented to the network. However, in order for thenetwork to work on real problems all different types of examples mustbe present in the training set, e.g. in the case of structure predictionboth good and bad models needs to be present in the training data, notonly good models or only native structures. The ultimate aim of theneural network training is to capture general features in the data thatcan be used to do predictions on unseen data, i.e. networks should beable to generalize. To achieve this goal it is important to not over-trainthe network. The over-training occurs when the network is trained tosuch extent that it performs very well on the training data, but ratherpoorly on a new test set not part of the training data, see Figure 7.To avoid over-training the training must be stopped at an appropriate

28

0

5000

10000

15000

20000

0 100 200 300 400 500 600 700 800 900 1000

erro

r

training cycle

test set error

training set error

over-trainingstarts here

Figure 7: Over-training. The error on the training set continues to decrease over all1,000 cycles, while the error on the test set starts to increase from 250 cycles. Thismeans that the generalization ability of the network deteriorates.

training cycle before it occurs. This is usually done by monitoring theerror on the test set during training and stop the training before theerror on the test starts to increase. If possible a separate test set shouldthen be used to assess the true generalization ability of the network.

4.2.4 Cross-validation

To check the performance and generalization ability of the network,cross-validation is usually performed. In cross-validation the totalamount of training data is split into n parts (typically n ∈ [5, 10]),the parts should not be too similar, thus two similar examples shouldbe in the same part. The training is then performed on n − 1 partsand the last part is used as a test set to check the performance. Thisis repeated n times using a different part as test set each time to mini-

29

mize the randomness. The cross-validation process allows efficient useof all training examples, while it at the same time gives some confi-dence that the neural network is able to generalize, provided that theperformances of the different cross-validation sets are similar.

4.2.5 Other issues when using neural networks

Neural networks is a powerful technique to provide a non-linear func-tional mapping from a set of input variables to a set of output vari-ables, which is illustrated by the large number of successful applica-tions. However, there are in particular two different kinds of problemsthat have to be dealt with when using neural networks. First, a neuralnetwork consists of a large number of parameters (sometimes on theorder of several thousands) and in order not to over-fit the parameterson the training data, the number of training examples should be of thesame order of magnitude as the number of parameters. Thus, sufficientamount of data is needed. Second, even tough the network architec-ture and parameter updating are well defined in a mathematical way,there is no obvious way of extracting knowledge about what featuresthat are important for the neural network prediction. This makes ithard to understand which features that are important and it also lim-its the biological usefulness of neural networks in the sense of findingexactly which structural features that are important for good models.However, it still useful for finding which features that are important ina more general sense and also for finding if certain features are orthog-onal to each other, i.e. if the prediction ability increases when differenttypes of features are used in training.

4.2.6 The neural networks in this study

In this study we have used Matlab and the neural network packagenetlab (Bishop, 1995) to perform all network training. In all casesfeed-forward neural networks were used together with five-fold cross-validation and the error back-propagation algorithm for training, lo-gistic hidden nodes and linear output node. Only one layer of hiddennodes was used. The most important part of neural network train-ing is the input data. The input data should capture general features

30

and should also allow for comparisons different examples. We haveused features that can be calculated from a protein structure mod-els, i.e. based on the 3D coordinates of all the different atoms in theprotein (Figure 4). We have also performed dimensionality reductionby classifying the 20 amino acids into six different groups (Mirny &Shakhnovich, 1999).

4.3 Benchmarks

Once a predictor has been developed it is important to benchmark itsperformance against other methods. It is extremely important thatthe data used to benchmark is not used for optimizing the parametersin the predictor. In this section benchmark techniques will be de-scribed. This involves a description of benchmarks based on the struc-tural database SCOP as well as a description of CASP and CAFASP,and finally the continuous benchmark done through the LiveBenchproject.

4.3.1 SCOP based

Benchmarks based on SCOP is used frequently for the evaluation ofthe recognition of correct structures based on sequence similarity. Asdescribed above SCOP is a hierarchical database, with three mainlevels fold, superfamily and family. Two structures related on thefamily level are more closely related (more similar) than two structuresrelated on the fold level. This also means that two sequences related onthe family level are more easily recognized than two sequences relatedon the fold level.

The first step in this type of benchmark is to do an all-against-all SCOPsearch with the method to benchmark. For each sequence all hits arecollected and the result is presented typically in specificity-sensitivityplots, where the specificity and sensitivity are plotted against eachother for all possible cutoffs. The specificity is defined as:

spec =TP

TP + FP(10)

31

where TP is the number of True Positives and FP the number of FalsePositives, i.e. the specificity is the fraction of positive hits defined bythe method that also are correct. The sensitivity is defined as:

sens =TP

TP + FN(11)

where TP is the number of True Positives and FN the number of FalseNegatives, i.e. the sensitivity is the fraction of all possible positive hitsthat are found. Since the specificity and sensitivity are two competingproperties, i.e. it is always possible to find all TP (100% sensitivity)but the specificity might be low (high number of FP ).

An example of a specificity-sensitivity plot is seen in Figure 8. Thecloser the method is to the upper right hand corner (1,1) the better.The figure contains three sets of curves, one for each level of the SCOPdatabase. For the family curves all hits are used, for the superfamilycurves hits to the same family are ignored and for the fold curves hits tothe same superfamily and family are ignored. In this way the detectionability for different difficult levels can be assessed. As expected, it isalso clearly seen, that detecting sequences related on the fold level ismuch more difficult than detecting sequence related on the family level.

4.3.2 CASP

CASP (Critical Assessment of Structure Prediction) is a large-scalecommunity experiment, conducted every two years since 1994 (Moultet al., 2003). The key feature is that the participants make blindpredictions of structures. Since the predictions are blind no one cancheat (hopefully) and also methods cannot be optimized to do good inCASP for the same reason.

The participants get the sequence for protein structures that are aboutto be determined by X-ray crystallography and NMR spectroscopy.The experiment is conducted over the summer and the result is col-lected and the most successful groups present their work at a confer-ence in December the same year. Since the structures are not knownbefore the predictions are made the predictions are totally blind. Theexperiment has been a success in the predictor community, each time

32

Figure 8: Example of a specificity-sensitivity plot from (Wallner et al., 2004).

attracting more and more participants, generating more and more pre-dictions (Table 1). In contrast the target collection relying on thegenerosity of experimentalists has been fairly constant. Even thoughthe number of participating groups has increased over the years thelargest increase is in the number of submitted models. This is mostlydue to the development of automatic methods that routinely submitspredictions for every target (see CAFASP).

Predictors fall into two categories: teams of participants who devoteconsiderable time and effort to modeling each target, usually having aperiod of several weeks to complete their work; and automatic servers,which must return a model within 48 hours, without human inter-vention. The predictions are evaluated using numerical criteria andmanual independent assessors.

There has been progress in the protein structure prediction field overthe CASP years, for a recent review see (Moult, 2005). A couple of

33

CASP number Year No. targets No. predictors Total no. predictions

CASP1 1994 33 35 135CASP2 1996 42 72 947CASP3 1998 43 98 3,807CASP4 2000 43 163 11,136CASP5 2002 67 215 28,728CASP6 2004 64 208 41,283

Table 1: The CASP experiments in numbers. Note the rapid increase in the totalnumber of predictions.

highlights include the successful use of PSI-BLAST (Altschul et al.,1997) in CASP3, the performance of Baker in the ab initio predictionin CASP4 (Bonneau et al., 2001) and the meta-servers (Paper II) inCASP5. However one major bottleneck is still the refinement problem,i.e. improving the models over a simple backbone copy of the template.At CASP6, for the first time, there was a report of an initial model ofa small protein (target 281) refined from a backbone RMSD of about2.2 A to about 1.6 A, with many of the core side-chains correctlyoriented. Still this is for a small protein and it remains to be seen if itcan be scaled to larger structure in the future.

4.3.3 CAFASP

CAFASP (Critical Assessment of Fully Automated Structure Predic-tion) is the automatic counterpart of CASP (Fischer et al., 2003). Theprediction in CAFASP follows the same scheme as CASP and is alsorun in parallel, with one important different - no human interventionis allowed. The goal is to evaluate the performance of fully automaticstructure prediction servers. As the predictions, the assessment is alsofully automated allowing fast evaluation of many servers. Anotherimportant difference from CASP is that all predictions have a score(from the server). This makes it possible to assess the reliability of thedifferent methods.

34

4.3.4 LiveBench

The LiveBench project is a continuous benchmarking program (Bu-jnicki et al., 2001a; Bujnicki et al., 2001b; Rychlewski et al., 2003).The basic scheme is as follows, every week new PDB proteins are sub-mitted to participating proteins structure prediction servers and theresults are collected and evaluated using automated model assessmentprograms. This provides a potential user with an excellent tool forevaluating the confidence of obtained models. And from the serverperspective, the server gets an independent unbiased benchmark.

The main advantages of LiveBench compared to CASP are the fastevaluation cycle and the larger number of targets. The assessmentis conducted a week after releasing the targets. Each week providesaround 5 new targets, which can be used to check differences in theperformance of the servers.

The LiveBench project has been run for almost six years with a steadilyincreasing number of participating servers with the exception of set 9(Figure 9). However, it should also be stated that many of the “new”servers are new versions of an existing server with the previous versionkept as reference.

5 Present investigation

The work presented in this thesis aims at the development of methodsthat improve the current protein structure prediction methods. In thefollowing, the methods and the results from the five papers includedin this thesis (Paper I-V) will be summarized. The presentation willfollow closely the development of methods in relation to the bi-annualCASP meetings (see above).

It has been clear since the first CASPs that human experts do bet-ter than automatic methods. However at CASP4 the best serversstarted to compete with the best humans. Only 11 human groupswere ranked before the best server (3D-PSSM). Furthermore, at rank7 was the semi-automated method named CAFASP-CONSENSUS thatused the results for other servers to do the prediction (Fischer et al.,

35

0

10

20

30

40

50

0 1 2 3 4 5 6 7 8 9 10

No. S

erve

r

LiveBench set

Number of participating servers in LiveBench

Figure 9: The number of participating servers in the LiveBench project.

2001). The CAFASP-CONSENSUS idea was then fully automatedin Pcons (Lundstrom et al., 2001), which was the first consensus ormeta predictor. By using the result from six different servers, Pconsgenerated approximately 8%-10% more correct predictions with a sig-nificantly higher specificity. Thus, at that time Pcons was the bestautomatic method for doing protein structure prediction. And this iswhere our story begins.

5.1 Development of a method that identifies

correct protein models (Paper I)

The main strength of the consensus analysis, as used in Pcons, is cou-pled to the structural similarity between the different models returnedfrom the servers. However, there are of course other factors that canbe used in the selection process as well, e.g. the score from the server

36

or an evaluation of the protein model using an energy function. Mostenergy function up to date seemed to be pretty good at picking the na-tive structure from a set of misfolded decoys. However, protein modelsare usually relatively far away from the native and even a quite goodmodel can contain extended regions and unfavorable interactions thatan energy function is too sensitive to find.

The aim of this project was to develop an energy function, ProQ,that would be able to handle the difficulties involved in finding thebest possible protein model. We decided to use a neural network ap-proach, as it would allow us to test many different types of input ina fairly straightforward manner. An important difference from earlierstudies is, that we used a more relaxed and in our minds more realis-tic definition of native-like models. We defined “correct” models in asimilar way as used in CASP (Moult et al., 2001; Sippl et al., 2001),CAFASP (Fischer et al., 2001) and LiveBench (Bujnicki et al., 2001a;Bujnicki et al., 2001b), i.e. by finding similar fragments between thenative structure and a model. The training set consisted of all-atommodels generated from a large set of different alignments methods, intotal 11,108 protein models were used. From these models a set ofstructural features, such as atom-atom contacts, residue-residue con-tacts, surface accessibility and secondary structure information werecalculated and neural networks were trained to predict the correctnessof the model using these features.

By combining all the structural features it was possible to increasethe correlation coefficient between the predicted model quality andthe actual correct model quality from 0.4 to 0.75. It was quite clearthat the combination of different types of structural features was themain reason for the improvement. This can be explained by the factthat incorrect protein models might have many good, i.e. native-likefeatures, whereas correct protein models might have many bad, i.e.native-unlike, features. The best way to distinguish between thesemodels is to use a combination of measures.

The performance of ProQ was compared to the performance of ex-isting methods for evaluating protein model quality. We found thatProQ was at least as good as other methods when identifying thenative structure and better at the detection of correct models. Thisperformance was also maintained over several different test sets and

37

quality measures.

Finally we combined ProQ with Pcons, in a predictor called Pmod-eller. This predictor built all-atom models using Modeller (Sali &Blundell, 1993) for all Pcons alignments. ProQ was then applied toassess the quality of the models; and all models were given a combinedscore based both on the consensus analysis using Pcons and on thestructural evaluation using ProQ. This combined Pmodeller scorewas shown to perform better than Pcons alone. Later studies (e.g.LiveBench and CASP) have confirmed this result, indicating the struc-tural evaluation using ProQ indeed is helpful in the selection of goodmodels.

5.2 Pcons, ProQ and Pmodeller at

CASP5 / CAFASP3 (Paper II)

The great exam was coming up - CASP season! According to ourbenchmarks Pmodeller should be doing pretty well in CASP. How-ever, being good in your own benchmark is one thing, being good inCASP is another story. CASP is the ultimate test. If you do well,no one can say that you have optimized your method to the test set.If you do bad that is exactly what you will hear. Another nice thingabout CASP, is that it provides a unique opportunity to compare theperformance of automatic servers (like Pmodeller and Pcons) withthe performance of manual experts who might even use these methods.

We participated in CASP with four servers: Pcons2, Pcons3, Pmod-eller and Pmodeller3, where Pmodeller uses Pcons2 and Pmod-eller3 uses Pcons3. The difference between the two Pcons versionis the servers they are using. Pcons2 uses seven servers, while Pcons3uses only three (but better) servers (Table I, Paper I). So the questionswere: Would Pmodeller do better than Pcons, i.e does the qualityassessment with ProQ really help? and how good are we in compari-son to other servers and other manual groups?

It turned out that Pmodeller did extremely well and was one of thetop ranked servers (Table IV, Paper II), it actually performed so wellthat we were selected to give a talk at the CASP5 meeting. The per-formance of Pmodeller was similar to all but a few manual experts.

38

Although a small group of experts still performed better, most of theexperts participating in CASP5 actually performed worse even thoughthey had full access to all automatic predictions and Pmodeller inparticular.

Pmodeller was consistently ranked higher than the correspondingPcons version in CASP5 and in CAFASP3 (Fischer et al., 2003); andlater also in LiveBench (Rychlewski et al., 2003). Further, it could alsobe shown that the homology modeling step in Pmodeller was not thereason for this improvement. Since no examples of significantly bettermodels were observed when Pmodeller and Pcons used the samealignment. Thus, the main reason for the difference in performancebetween Pmodeller and Pcons was the ProQ structural evalua-tion of the quality of the protein models. This procedure increasedboth the total number of correct predictions and the specificity of thepredictions, by reducing the number of false positives.

The main lessons from CASP5 were that by using a consensus server,like Pmodeller, it was possible to compete with most manual ex-perts. However, some manual experts did significantly better predic-tions than the best consensus servers. One noticeable difference be-tween Pmodeller and these best manual groups was that the totalnumber of residues predicted was about 5% less for Pmodeller, i.e.the models from Pmodeller were shorter. A possible reasons forthis is that Pmodeller was restricted to one alignment per model.During the development of Pmodeller we tried using multiple tem-plates alignments, to increase the target coverage, but without success.Another related problem is the division of the target sequence into do-mains, where most servers will only return the hit to the best domain,resulting in a model with only one domain. An iterative approachwhere regions with no hits are resubmitted to the servers has provento be useful (Chivian et al., 2003).

5.3 Detecting correct residues with ProQres and

ProQprof (Paper III)

After CASP5 it was clear that the best groups had used meta-serversin combination with additional measures to assess the quality of the

39

final models. In particular many groups used Verify3d (Bowie et al.,1991; Luthy et al., 1992) to assess the quality of different parts themodels (Kosinski et al., 2003; von Grotthuss et al., 2003). The con-sensus idea has also been applied and used to find the parts of a modelthat are most reliable (Fischer, 2003). Both these approaches werethen used as a basis to build a hybrid model using multiple templates.

Given the success of the Pmodeller server and in particular theProQ module, we thought that it would be possible to use similarideas to develop a local version of ProQ, ProQres. This versionwould, instead of predicting the quality of the whole model, predict thequality of single residues. Since the combination of different structuralfeatures made ProQ perform significantly better than Verify3d atdetecting correct protein models. It should be possible to use differentlocal structural features and perform better at the residue level aswell. In addition we also developed a local quality predictor that useda window of profile scores to assess the quality of protein models builtfrom an alignment, ProQprof.

As for ProQ we used neural nets. These were trained on models builtfrom structural alignments between proteins related on the superfamilylevel according to SCOP using five-fold cross-validation. The finalpredictors were then benchmarked using two additional test sets, onederived from LiveBench and one derived from SCOP.

The structural features used in ProQres were identical to those usedin ProQ, i.e. atom-atom contacts, residue-residue contacts, solventaccessibility surfaces and secondary structure information. However,in order get a local quality prediction the environment around eachresidue was described by calculating the different structural featuresfor a sliding window and predict the correctness of the central residue.Hereby, each residue in the sequence is described by all the contactsmade by all residues in the window.

ProQprof used a window of profile scores calculated from the align-ment as the basis for training. This score were calculated using the theLog Aver (von Ohsen & Zimmer, 2001) scoring function on equivalentpositions (vectors) in two profiles corresponding to the two sequencesin the alignment.

Both ProQres and ProQprof were shown to be more accurate than

40

existing methods. ProQprof performs better than ProQres formodels based on distant relationships, while ProQres performs bestfor models based on closer relationship. We also showed that a com-bination of ProQres and ProQprof performs better than any ofthem alone. A possible explanation for this difference in performancedepending on model quality, is that poor models lack many nativelong-range interactions, which make a structural evaluation less reli-able, while a profile-profile comparison is able to find these positions.On the other hand models of higher quality have many more nativelong-range interactions, i.e. improving the structural evaluation.

5.4 Pcons5: the fifth generation of Pcons

(Paper IV)

In this paper we describe the development of Pcons5, which is sig-nificantly different from the previous Pcons versions. Pcons5 is alsothe first available stand alone version of Pcons, and it should be idealfor those who would like to set up a local meta-server.

Pcons5 was developed with the goal to be easy to up update with newservers. It consists of three modules, a consensus module, a qualityassessment module and score module. Each module returns two scoresand these score are combined in a final score using a weighted sum.Pcons5 is not restricted to a fixed number of specific servers, it ispossible to run both the consensus and the quality assessment moduleon any set of models. It is only the score module that is methodspecific.

The consensus module returns the average similarity to all models,and the average similarity to the first ranked models from each server.The quality module assess the quality of the models using a backboneversion of ProQ. The score module returns two scores that measurethe reliability of the server score, e.g BLAST E-value below 10−10 wouldbe given a high score and a E-value above 10−1 a low score etc.

In addition we also developed a Pmodeller5 version based on Pcons5that built all-atom models using Modeller (Sali & Blundell, 1993)and re-rank the models using the original version of ProQ, which issignificantly better than the backbone ProQ module used in Pcons5.

41

Pcons5 and Pmodeller5 participated in CASP6. For technical rea-sons we had to start the server manually, but they were run withouthuman intervention. Thus, in the official CASP rankings we are listedas a manual group, while in fact it was an automatic server. Thiswas also the reason why we were not selected to talk this time, sincePmodeller was indeed one of the best servers. It also performed bet-ter than Pcons5 and the previous versions of Pmodeller and Pcons(according to the sum of the GDT TS for the first ranked model).

5.5 Are all homology modeling programs equal?

(Paper V)

In this study we presented the first large-scale benchmark of publiclyavailable homology modeling programs, i.e. programs that constructthe 3D coordinates of a protein model based on the alignment to aknown protein structure. The following six programs were tested:Modeller, SegMod/encad, swiss-model, 3d-jigsaw, nest andBuilder. The idea was simple: give each program the same input(alignment and template) and analyze the model it produced withrespect to physiochemical criteria and structural similarity to the cor-rect structure. However, it turned out that making the comparisonwas not as simple and straightforward as we had expected. The reasonwas that some programs crashed for certain inputs and sometimes theprograms did not return coordinates for all residues in the sequence.To not return any coordinates is probably worse than to return wrongcoordinates, but not everyone will agree on this assumption. In anycase, we chose to ignore these cases and only compared models, orrather residues that existed in all models from all programs. In thefollowing, the most important results from this study will be presented.perspective.

From our study it was clear that the different modeling approaches giverise to differences in the final model. It is true that the alignment is cru-cial for the quality of the final model and no model building procedurecan recover from an incorrect alignment. The effect of an alignmenterror though, can be very dependent on the modeling procedure, e.g.a rigid-body assembly method will be much more sensitive to align-

42

ment errors than a method that derives distance restraints from thealignment (Figure 1, Paper V). It was also clear that no single mod-eling program performs best in all different test. All programs hadits pros and cons (Table 4, Paper V), but three modeling programs,Modeller, nest and SegMod/encad performed better than theother programs. These programs were reliable and most often theyproduced chemically correct models.

SegMod/encad perform very well in all tests except for backboneconformation, nest very rarely makes the models worse than the tem-plate, but the chemistry is not as good as for SegMod/encad, whileModeller is in general good, except for a few examples of poor con-vergence and suboptimal side-chain positioning.

swiss-model failed to produce models for 10% of all alignments (allothers failed < 1%). However, it performed quite well when it produceda model. 3d-jigsaw and Builder had problems with modeling allresidues, resulting in a decreased overall performance.

Also none of the programs build side-chains as well as a program spe-cialized for this task such as SCWRL, using the same information. Itshould therefore be room for improvement within this area.

43

6 Conclusions and reflections

By using a machine-learning approach we have shown that it is possibleto predict the quality of protein models based on structural features(Paper I-III). We also showed that different structural features containorthogonal information and that the combination of different structuralfeatures are important to distinguish between correct and incorrectprotein models (Paper I). Further, we have improved the best meta-predictor, Pcons, by the including a quality assessment step using ourquality predictor, ProQ, in the Pmodeller series. The improvementusing ProQ has been confirmed in both in LiveBench and in CASP,where Pmodeller has been one of the top performing servers.

Our latest predictors, ProQres and ProQprof, predicts the localmodel quality based on local structural features and on evolutionaryinformation respectively. Using evolutionary information is clearlybeneficial to detect good regions in protein models, while structuralinformation is more useful to detect poor regions. These predictorsmight prove to be useful in the construction of hybrid models usingmore than one template.

We have also conducted the first large-scale benchmark of publiclyavailable homology modeling programs. Here we showed that there aredifferences between the different modeling programs. Three modelingprograms, Modeller, nest and SegMod/encad performed betterthan the other programs.

Clearly, one of the major bottlenecks in the protein structure predic-tion field today is refinement, e.g. taking a 2A model down to 1A.There have been some progress in this area, by using protein evolutionto restricted the refinement to the parts of the model that are morelikely to undergo a change (Qian et al., 2004). However, even the re-stricted sampling requires accurate energy functions to select the finalconformation. Most likely these energy functions will be a combinationof different features taken both from physics and evolution.

44

7 Acknowledgments

This is probably the most important part. I want to acknowledge allthe people contributing in one way or the other in the production ofthis thesis.

First of all, Arne Elofsson, my supervisor, for many encouraging dis-cussions, great ideas and for being the best supervisor you could everhave.

My roommates at SBC Erik and Asa for always arriving early and forproviding a perfect excuse to discuss the important things in life.

Johannes for introducing me to the world of Hattrick and for his un-believable stories that actually are true.

Tomas for explaining how to order “macchiato due” in Rome.

All other members of the constantly growing AE group: Diana Ekman,Oliva Eriksson, Kristoffer Illergard, Per Larsson and Hakan Viklundfor being such lovely colleagues.

All other PhD students, post-docs, projects students, sys admins, as-sistant, associate and full professors at SBC for creating a wonderfulworking environment.

The Swedish government for setting up the Research School in Ge-nomics and Bioinformatics to pay for my salary.

Helge Ax:son Johnsons Foundation and Jan-Arthur Ekstroms Minness-tiftelse for providing additional financial support.

Olov, Jakob and Magnus for your friendship - see you on the beach inHerta in October!

Magnus for our endless discussions at the gym, I even explained theproblem of angles in three dimensions once.

My parents, for encouraging me to walk my own paths and pursue myown goals and for always helping out.

Susanna, the love of my life, thanks for always being by my side.

Stockholm, August 2005.

45

References

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z.,Miller, W., and Lipman, D.J.. 1997. Gapped BLAST and PSI–BLAST: a new generation of protein database search programs.Nucleic Acids Res 25 (17):3389–3402.

Anfinsen, C. B.. 1973. Principles that govern the folding of proteinchains. Science 181 (96):223–230.

Aszodi, A., and Taylor, W. R.. 1996. Homology modelling by distancegeometry. Fold Des 1 (5):325–334.

Bairoch, A., Apweiler, R., Wu, C. H., Barker, W. C., Boeckmann,B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M.,Martin, M. J., Natale, D. A., O’Donovan, C., Redaschi, N., andYeh, L. S.. 2005. The universal protein resource (uniprot). NucleicAcids Res 33 (Database issue):D154–9.

Baker, D., and Sali, A.. 2001. Protein structure prediction and struc-tural genomics. Science 294 (5540):93–96.

Bates, P. A., Kelley, L. A., MacCallum, R. M., and Sternberg, M. J..2001. Enhancement of protein modeling by human interventionin applying the automatic programs 3D-JIGSAW and 3D-PSSM.Proteins Suppl 5:39–46.

Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N.,Weissig, H., Shindyalov, I. N., and Bourne, P. E.. 2000. Theprotein data bank. Nucleic Acids Res 28 (1):235–242.

Bishop, Christopher M.. 1995. Neural Networks for Pattern Recogni-tion. Great Clarendon St, Oxford OX2 6DP, UK: Oxford Univer-sity Press.

Blom, N., Gammeltoft, S., and Brunak, S.. 1999. Sequence andstructure-based prediction of eukaryotic protein phosphorylationsites. J Mol Biol 294 (5):1351–1362.

46

Blundell, T. L., Sibanda, B. L., Sternberg, M. J., and Thornton, J. M..1987. Knowledge-based prediction of protein structures and thedesign of novel molecules. Nature 326 (6111):347–352.

Bonneau, R., and Baker, D.. 2001. Ab initio protein structure predic-tion: progress and prospects. Annu Rev Biophys Biomol Struct30:173–189.

Bonneau, R., Tsai, J., Ruczinski, I., Chivian, D., Rohl, C., Strauss,C.E., and Baker, D.. 2001. Rosetta in CASP4: progress in ab initioprotein structure prediction. Proteins 45 (Suppl 5):119–126.

Bowie, J.U., Luthy, R., and Eisenberg, D.. 1991. A method to identifyprotein sequence that fold into a known three-dimensional struc-ture. Science 253 (5016):164–170.

Browne, W. J., North, A. C., Phillips, D. C., Brew, K., Vanaman,T. C., and Hill, R. L.. 1969. A possible three–dimensional struc-ture of bovine alpha–lactalbumin based on that of hen’s egg–whitelysozyme. J Mol Biol 42 (1):65–86.

Bryant, S.H., and Amzel, L.M.. 1987. Correctly folded proteins maketwice as many hydrophobic contacts. Int J Pept Prot Res 29:46–52.

Bryant, S. H., and Lawrence, C. E.. 1991. The frequency of ion-pairsubstructures in proteins is quantitatively related to electrostaticpotential: a statistical model for nonbonded interactions. Proteins9 (2):108–119.

Bujnicki, J.M., Elofsson, A., Fischer, D., and Rychlewski, L.. 2001a.Livebench–1: continous benchmarking of protein structure pre-diction servers. Protein Sci. 10 (2):352–361.

Bujnicki, J.M., Elofsson, A., Fischer, D., and Rychlewski, L.. 2001b.Livebench–2: large-scale automated evaluation of protein struc-ture prediction servers. Proteins 45 (Suppl 5):184–191.

Chivian, D., Kim, D. E., Malmstrom, L., Bradley, P., Robertson, T.,Murphy, P., Strauss, C. E., Bonneau, R., Rohl, C. A., and Baker,

47

D.. 2003. Automated prediction of CASP-5 structures using therobetta server. Proteins 53 Suppl 6:524–533.

Chothia, C., and Lesk, A. M.. 1986. The relation between the diver-gence of sequence and structure in proteins. EMBO J 5 (4):823–826.

Claessens, M., Cutsem, E. Van, Lasters, I., and Wodak, S.. 1989. Mod-elling the polypeptide backbone with ’spare parts’ from knownprotein structures. Protein Eng 2 (5):335–345.

Colovos, C., and Yeates, T.O.. 1993. Verification of protein struc-tures: patterns of nonbonded atomic interactions. Protein Sci. 2(9):1511–1519.

Cristobal, S., Zemla, A., Fischer, D., Rychlewski, L., and Elofsson, A..2001. A study of quality measures for protein threading models.BMC Bioinformatics 2 (5).

Dominy, B.N., and Brooks, C.L.. 2002. Identifying native-like proteinstructures using physics-based potentials. J Comput Chem. 23(1):147–160.

Ekman, D., Bjorklund, A. K., Frey-Skott, J., and Elofsson, A.. 2005.Multi-domain proteins in the three kingdoms of life: orphan do-mains and other unassigned regions. J Mol Biol 348 (1):231–243.

Elofsson, A.. 2002. A study on how to best align protein sequences.Proteins 15 (3):330–339.

Emanuelsson, O., Nielsen, H., and von Heijne, G.. 1999. ChloroP,a neural network-based method for predicting chloroplast transitpeptides and their cleavage sites. Protein Sci 8 (5):978–984.

Felts, A.K., Gallicchio, E., Wallqvist, A., and Levy, R.M.. 2002. Dis-tinguishing native conformations of proteins from decoys with aneffective free energy estimator based on the OPLS all–atom forcefield and the Surface Generalized Born solvent model. Proteins48 (2):404–422.

Fischer, D.. 2003. 3D-SHOTGUN: a novel, cooperative, fold-recognition meta-predictor. Proteins 51 (3):434–441.

48

Fischer, D., Elofsson, A., Rychlewski, L., Pazos, F., Valencia, A., Rost,B., Ortiz, A. R., and r. Dunbrack RL, J.. 2001. CAFASP2: thesecond critical assessment of fully automated structure predictionmethods. Proteins Suppl 5:171–183.

Fischer, D., Rychlewski, L., r. Dunbrack RL, J., Ortiz, A. R., andElofsson, A.. 2003. CAFASP3: the third critical assessment offully automated structure prediction methods. Proteins 53 Suppl6:503–516.

Ginalski, K., Elofsson, A., Fischer, D., and Rychlewski, L.. 2003. 3d-jury: a simple approach to improve protein structure predictions.Bioinformatics 19 (8):1015–1018.

Greer, J.. 1990. Comparative modeling methods: application to thefamily of the mammalian serine proteases. Proteins 7 (4):317–334.

Gregoret, L. M., and Cohen, F. E.. 1991. Protein folding. effect ofpacking density on chain conformation. J Mol Biol 219 (1):109–122.

Havel, T. F., and Snow, M. E.. 1991. A new method for buildingprotein conformations from sequence alignments with homologuesof known structure. J Mol Biol 217 (1):1–7.

Holm, L., and Sander, C.. 1996. Mapping the protein universe. Science273:595–603.

Jones, D.T.. 1999a. GenTHREADER: an efficient and reliable proteinfold recognition method for genomic sequences. J Mol Biol 287(4):797–815.

Jones, D.T.. 1999b. Protein secondary structure prediction based onposition–specific scoring matrices. J Mol Biol 292 (2):195–202.

Jones, T. A., and Thirup, S.. 1986. Using known substructures inprotein model building and crystallography. EMBO J 5 (4):819–822.

Koehl, P., and Delarue, M.. 1994a. Application of a self-consistentmean field theory to predict protein side-chains conformation and

49

estimate their conformational entropy. J Mol Biol 239 (2):249–275.

Koehl, P., and Delarue, M.. 1994b. Polar and nonpolar atomic environ-ments in the protein core: implications for folding and binding.Proteins 20 (3):264–278.

Koehl, P., and Delarue, M.. 1995. A self consistent mean field ap-proach to simultaneous gap closure and side-chain positioning inhomology modelling. Nat Struct Biol 2 (2):163–170.

Koehl, P., and Delarue, M.. 1996. Mean-field minimization methods forbiological macromolecules. Curr Opin Struct Biol 6 (2):222–226.

Kosinski, J., Cymerman, I. A., Feder, M., Kurowski, M. A., Sasin,J. M., and Bujnicki, J. M.. 2003. A ”FRankenstein’s monster”approach to comparative modeling: merging the finest fragmentsof fold-recognition models and iterative model refinement aidedby 3D structure evaluation. Proteins 53 Suppl 6:369–379.

Laskowski, R.A., MacArthur, M.W., Moss, D.S, and Thornton, J.M..1993. Procheck: a program to check the stereochemical quality ofprotein structures. J Appl Cryst 26 (2):283–291.

Lazaridis, T., and Karplus, M.. 1999. Discrimination of the nativefrom misfolded protein models with an energy function includingimplicit solvation. J Mol Biol 288 (3):477–487.

Lazaridis, T., and Karplus, M.. 2000. Effective energy functions forprotein structure prediction. Curr. Opin. Struct. Biol. 10 (2):139–145.

Levitt, M.. 1992. Accurate modeling of protein conformation by auto-matic segment matching. J Mol Biol 226 (2):507–533.

Lindahl, E., and Elofsson, A.. 2000. Identification of related proteinson family, superfamily and fold level. J Mol Biol 295 (3):613–625.

Lundstrom, J., Rychlewski, L., Bujnicki, J., and Elofsson, A.. 2001.Pcons: a neural-network-based consensus predictor that improvesfold recognition. Protein Sci. 10 (11):2354–2362.

50

Luthy, R., Bowie, J.U., and Eisenberg, D.. 1992. Assessment of proteinmodels with three–dimensional profiles. Nature 356 (6364):283–85.

Marti-Renom, M.A., Stuart, A.C., Fiser, A., Sanchez, R., Melo, F., andSali, A.. 2000. Comparative protein structure modeling of genesand genomes. Annu Rev Biophys Biomol Struct 29:291–325.

Mirny, L.A., and Shakhnovich, E.I.. 1999. Universally conserved posi-tions in protein folds: reading evolutionary signals about stability,folding kinetics and function. J Mol Biol 291 (1):177–198.

Moult, J.. 2005. A decade of CASP: progress, bottlenecks and prog-nosis in protein structure prediction. Curr Opin Struct Biol 15(3):285–289.

Moult, J., Fidelis, K., Zemla, A., and Hubbard, T.. 2001. Criticalassessment of methods of protein structure prediction (CASP):round IV. Proteins 43 (Suppl 5):2–7.

Moult, J., Fidelis, K., Zemla, A., and Hubbard, T.. 2003. Criticalassessment of methods of protein structure prediction (CASP)-round V. Proteins 53 Suppl 6:334–339.

Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C.. 1995.SCOP: a structural classification of proteins database for the in-vestigation of sequences and structures. J Mol Biol 247 (4):536–540.

Ohlson, T., Wallner, B., and Elofsson, A.. 2004. Profile-profile methodsprovide improved fold-recognition: a study of different profile-profile alignment methods. Proteins 57 (1):188–197.

Petrey, D., Xiang, Z., Tang, C. L., Xie, L., Gimpelev, M., Mitros, T.,Soto, C. S., Goldsmith-Fischman, S., Kernytsky, A., Schlessinger,A., Koh, I. Y., Alexov, E., and Honig, B.. 2003. Using multiplestructure alignments, fast model building, and energetic analysisin fold recognition and homology modeling. Proteins 53 Suppl6:430–435.

51

Qian, B., Ortiz, A. R., and Baker, D.. 2004. Improvement of compar-ative model accuracy by free-energy optimization along principalcomponents of natural structural variation. Proc Natl Acad Sci US A 101 (43):15346–15351.

Qian, N., and Sejnowski, T. J.. 1988. Predicting the secondary struc-ture of globular proteins using neural network models. J Mol Biol202 (4):865–884.

Rost, B.. 1999. Twilight zone of protein sequence alignments. ProteinEng 12 (2):85–94.

Rost, B., and Sander, C.. 1993. Prediction of protein secondary struc-ture structure at better than 70% accuracy. J Mol Biol 232(2):584–599.

Rost, B., Sander, C., and Schneider, R.. 1994. PHD–an automaticmail server for protein secondary structure prediction. ComputAppl Biosci 10 (1):53–60.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J.. 1986. Learninginternal representations by error propagation. Cambridge, MA,USA: MIT Press.

Rychlewski, L., Fischer, D., and Elofsson, A.. 2003. Livebench-6: large-scale automated evaluation of protein structure prediction servers.Proteins 53 Suppl 6:542–547.

Rychlewski, L., Jaroszewski, L., Li, W., and Godzik, A.. 2000. Com-parison of sequence profiles. strategies for structural predictionsusing sequence information. Protein Sci. 9 (2):232–241.

Sadreyev, R., and Grishin, N.. 2003. COMPASS: a tool for compar-ison of multiple protein alignments with assessment of statisticalsignificance. J Mol Biol 326 (1):317–336.

Sali, A., and Blundell, T.L.. 1993. Comparative modelling by statis-faction of spatial restraints. J Mol Biol 234 (3):779–815.

Schwede, T., Kopp, J., Guex, N., and Peitsch, M.C.. 2004. SWISS–MODEL: an automated protein homology-modeling server. Nu-cleic Acids Res 31 (7):3381–3385.

52

Siew, N., Elofsson, A., Rychlewski, L., and Fischer, D.. 2000. Maxsub:an automated measure to assess the quality of protein structurepredictions. Bionformatics 16 (9):776–785.

Simons, K.T., Bonneau, R., Ruczinski, I., and Baker, D.. 1999. Abinitio protein structure predictions of CASP III targets usingROSETTA,. Proteins Suppl 3:171–176.

Sippl, M.J.. 1993. Recognition of errors in three–dimensional structuresof proteins. Proteins 17 (4):355–362.

Sippl, M.J., Lackner, P., Domingues, F.S., Prlic, A., Malik, R., An-dreeva, A., and Wiederstein, M.. 2001. Assessment of the CASP4fold recognition category. Proteins 45 (Suppl 5):55–67.

Srinivasan, S., March, C. J., and Sudarsanam, S.. 1993. An automatedmethod for modeling proteins on known templates using distancegeometry. Protein Sci 2 (2):277–289.

Still, W.C., Tempczyk, A., Hawley, R.C., and Hendrickson, T.J.. 1990.Semianalytical treatment of solvation for molecular mechanics anddynamics. J Am Chem Soc 112 (16):6127–6129.

Sanchez, R., and Sali, A.. 1997. Comparative protein modeling as anoptimization problem. J. Mol. Struct. 398 (3):489–496.

Unger, R., Harel, D., Wherland, S., and Sussman, J. L.. 1989. A 3dbuilding blocks approach to analyzing and predicting structure ofproteins. Proteins 5 (4):355–373.

Venclovas, C., Zemla, A., Fidelis, K., and Moult, J.. 2003. Assessmentof progress over the CASP experiments. Proteins 53 Suppl 6:585–595.

Vitkup, D., Melamud, E., Moult, J., and Sander, C.. 2001. Complete-ness in structural genomics. Nat Struct Biol 8 (6):559–566.

von Grotthuss, M., Pas, J., Wyrwicz, L., Ginalski, K., and Rychlewski,L.. 2003. Application of 3D-jury, GRDB, and Verify3D in foldrecognition. Proteins 53 Suppl 6:418–423.

53

von Ohsen, N., and Zimmer, R.. 2001. Improving profile-profile align-ments via log average scoring. In WABI ’01: Proceedings of theFirst International Workshop on Algorithms in Bioinformatics pp.11–26, Springer-Verlag, London, UK.

Wallner, B., Fang, H., Ohlson, T., Frey-Skott, J., and Elofsson, A..2004. Using evolutionary information for the query and targetimproves fold recognition. Proteins 54 (2):342–350.

Yona, G., and Levitt, M.. 2000. Towards a complete map of the proteinspace based on a unified sequence and structure analysis of allknown proteins. Proc Int Conf Intell Syst Mol Biol 8:395–406.

54

protein structure prediction: model building and quality ...196554/fulltext01.pdf · use of a...

Documents