protein structure prediction sequence database searching domain assignment multiple sequence...

73
Protein Structure Prediction • Sequence database searching • Domain assignment • Multiple sequence alignment • Comparative or homology modeling • Secondary structure prediction

Upload: sabina-anderson

Post on 29-Dec-2015

229 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Protein Structure Prediction

• Sequence database searching• Domain assignment• Multiple sequence alignment• Comparative or homology modeling• Secondary structure prediction

Page 2: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure
Page 3: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure
Page 4: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Homologous Proteins• The term of homology as used in a biological context is

defined as similarity of structure, physiology, development and evolution of organisms based upon common genetic factors.

• The statement that two proteins are homologous implies that their genes have evolved from a common ancestral gene. Usually they might have similar functions.

• Two proteins are considered to be homologous when they have identical amino acid residues in a significant number of sequential positions along the polypeptide chains (> 30 %).

• Homologous proteins have conserved structural cores and variable loop regions.

Page 5: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

The Divergence of Amino-acid Sequence and 3D Structure for the Core Region of Homologous

Proteins

• Known structures of 32 pairs of homologous proteins such as globins, serine proteinases, and immunoglobulin domains have been compared. The root mean square deviation of the main-chain atoms of the core regions is plotted as a function of amino acid homology. The curve represents the best fit of the dots to an exponential function. Pairs with high sequence homology are almost identical in three-dimensional structure, whereas deviations in atomic positions for pairs of low homology are on the order of 2 Å.

Page 6: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

A Generalized Approach to Predicting Protein Structure

• Relevant experimental data• Sequence data/preliminary analysis• Sequence Database searching• Domain assignment• Multiple sequence alignment• Comparative or homology modeling• Secondary structure prediction• Fold Recognition• Analysis of folds and alignment of

secondary structures• Sequence to structure alignment

Page 7: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Flow Chart

• This flowchart assumes that the protein is soluble, likely comprises a single domain, and does not contain non-globular regions.

Page 8: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Experimental Data Much experimental data can aid the structure prediction process.

Some of these are listed below:• Disulphide bonds, which provide tight restraints on the

location of cysteines in space • Spectroscopic data, which can give ideas as to the secondary

structure content of the protein • Site-directed mutagenesis studies, which can give insights as

to residues involved in active or binding sites • Knowledge of proteolytic cleavage sites, post-translational

modifications, such as phosphorylation or glycosylation can suggest residues that must be accessible, etc.

• Remember to keep all of the available data in mind when doing predictive work. Always ask whether a prediction agrees with the results of experiments. If not, then it may be necessary to modify what has been completed.

Page 9: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Protein Sequence Data• There is some value in doing some initial analysis on the

protein sequence. If a protein has come (for example) directly from a gene prediction, it may consist of multiple domains. More seriously, it may contain regions that are unlikely to be globular, or soluble.

• Is the protein a transmembrane protein, or does it contain transmembrane segments? There are many methods for predicting these segments, including:

• TMAP (EMBL) http://www.mbb.ki.se/tmap/index.html

• PredictProtein (EMBL/Columbia) http://dodo.cpmc.columbia.edu/predictprotein/

• TMHMM (CBS, Denmark)

• TMpred (Baylor College)

• DAS (Stockholm)

Page 10: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

http://www.mbb.ki.se/tmap/index.html

Page 11: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

COILS - Prediction of Coiled Coil Regions in Proteins

• Does the protein contain coiled-coils? Prediction of coiled coils can be completed at the COILS server or by downloading the COILS program. http://www.ch.embnet.org/software/COILS_form.html

• COILS is a program that compares a sequence to a database of known parallel two-stranded coiled-coils and derives a similarity score. By comparing this score to the

distribution of scores in globular and coiled-coil proteins, the program then calculates the probability that the sequence will adopt a coiled-coil conformation.

COILS was described in:

Lupas, A., Van Dyke, M., and Stock, J. (1991) Predicting Coiled Coils from Protein Sequences, Science 252:1162-1164.

Page 12: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure
Page 13: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Does the Protein Contain Regions of Low Complexity?

• Proteins frequently contain runs of poly-glutamine or poly-serine, which do not predict well. To check for this the program SEG (a version of SEG is also contained within the GCG suite of programs) can be employed. ftp://ftp.ncbi.nlm.nih.gov/pub/seg/seg/

• If the answer to any of the above questions is yes, then it is worthwhile trying to break the sequence into pieces or ignore particular sections of the sequence, etc. This is related to the problem of locating domains.

Page 14: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Multiple Sequence Alignment

Alignments can provide:• Information to protein domain structure• The location of residues likely to be involved in

protein function• Information of residues likely to be buried in

the protein core or exposed to solvent• More information on a single sequence for

applications like homology modeling and secondary structure prediction.

Page 15: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure
Page 16: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Sequence Database Searching

• The most obvious first stage in the analysis of any new sequence is to perform comparisons with sequence databases to find homologues. These searches can now be performed just about anywhere and on just about any computer. In addition, there are numerous web servers for doing searches, where one can post or paste a sequence into the server and receive the results interactively.

Page 17: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Sequence Database Searching• There are many methods for sequence searching. By far

the most well known are the BLAST suite of programs. One can easily obtain versions to run locally (either at NCBI or Washington University), and there are many web pages that permit one to compare a protein or DNA sequence against a multitude of gene and protein sequence databases. To name just a few:

• National Center for Biotechnology Information (USA) Searches

– http://www.ncbi.nlm.nih.gov/BLAST/

• European Bioinformatics Institute (UK) Searches

– http://www2.ebi.ac.uk/

• BLAST search through SBASE (domain database; ICGEB, Trieste)

Page 18: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

BLAST• One of the most important advances in sequence

comparison recently has been the development of both gapped BLAST and PSI-BLAST (position specific interated BLAST).

• Both of these have made BLAST much more sensitive, and the latter is able to detect very remote homologues by taking the results of one search, constructing a profile and then using this to search the database again to find other homologues (the process can be repeated until no new sequences are found).

• It is essential that one compares any new protein sequence to the database with PSI-BLAST to see if known structures can be found prior to doing any of the other methods discussed in the next sections.

Page 19: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure
Page 20: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Sequence Database SearchingOther methods for comparing a single sequence to a

database include:• The FASTA suite (William Pearson, University of

Virginia, USA) – http://alpha10.bioch.virginia.edu/fasta/

• SCANPS (Geoff Barton, European Bioinformatics Institute, UK) – http://barton.ebi.ac.uk/new/software.html

• BLITZ (Compugen's fast Smith Waterman search) – http://www2.ebi.ac.uk/bic_sw/

Page 21: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Multiple Sequence Database Searching• It is also possible to use multiple sequence information to

perform more sensitive searches. Essentially this involves building a profile from some kind of multiple sequence alignment. A profile essentially gives a score for each type of amino acid at each position in the sequence, and generally makes searches more sensitive.

Tools for doing this include: • PSI-BLAST (NCBI, Washington) • ProfileScan Server (ISREC, Geneva)

– http://www.isrec.isb-sib.ch/software/PFSCAN_form.html• HMMER Hidden Markov Model searching (Sean Eddy,

Washington University) – http://hmmer.wustl.edu/

• Wise package (Ewan Birney, Sanger Centre; this is for protein versus DNA comparisons) and several others. – http://www.sanger.ac.uk/Software/Wise2/

Page 22: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Multiple Sequence Searching Using a Motif• A different approach for incorporating multiple sequence

information into a database search is to use a MOTIF. Instead of giving every amino acid some kind of score at every position in an alignment, a motif ignores all but the most invariant positions in an alignment, and just describes the key residues that are conserved and define the family. Sometimes this is called a "signature".

• For example, "H-[FW]-x-[LIVM]-x-G-x(5)-[LV]-H-x(3)-[DE]" describes a family of DNA binding proteins. It can be translated as "histidine, followed by either phenylalanine or tryptophan, followed by any amino acid (x), followed by leucine, isoleucine, valine or methionine, followed by any amino acid (x), followed by glycine, . . . [etc.]".

Page 23: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Multiple Sequence Searching Using a Motif• PROSITE (ExPASy Geneva) contains a huge number of such

patterns, and several sites allow you to search these data:ExPASy

http://www.expasy.ch/tools/scnpsite.htmlEBI http://www2.ebi.ac.uk/ppsearch/

• It is best to search a few different databases in order to find as many homologues as possible. A very important thing to do, and one which is sometimes overlooked, is to compare any new sequence to a database of sequences for which 3D structure information is available. Whether or not the sequence is homologous to a protein of known 3D structure is not obvious in the output from many searches of large sequence databases. Moreover, if the homology is weak, the similarity may not be apparent at all during the search through a larger database.

• One can save a lot of time by making use of pre-prepared protein alignment.

Page 24: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Web sites for Performing Multiple Alignment

• EBI (UK) Clustalw Server – http://www2.ebi.ac.uk/clustalw/

• IBCP (France) Multalin Server – http://www.ibcp.fr/multalin.html

• IBCP (France) Clustalw Server

• IBCP (France) Combined Multalin/Clustalw

• MSA (USA) Server – http://www.ibc.wustl.edu/ibc/msa.html

• BCM Multiple Sequence Alignment ClustalW Sever– http://dot.imgen.bcm.tmc.edu:9331/multi-align/Options/clustalw.html

Page 25: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Some Tips for Sequence Alignment• Don't just take everything found in the searches and feed them

directly into the alignment program. Searches will almost always return matches that do not indicate a significant sequence similarity. Look through the output carefully and throw things out if they don't appear to be a member of the sequence family. Inclusion of non-members in the alignment will confuse things and likely lead to errors later.

• Remember that the programs for aligning sequences aren't perfect, and do not always provide the best alignment. This is particularly so for large families of proteins with low sequence identities. If a better way of aligning the sequences is discovered, then by all means edit the alignment manually.

Page 26: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Locating Domains• If the sequence has more than about 500 amino acids, it is

almost certain that it will be divided into discrete functional domains. If possible, it is preferable to split such large proteins up and consider each domain separately. One can predict the location of domains in a few different ways. The methods below are given (approximately) from the most to the least confident.

• If homology to other sequences occurs only over a portion of the probe sequence and the other sequences are whole (i.e. not partial sequences), then this provides the strongest evidence for domain structure. Either complete database searches or make use of pre-defined databases of protein domains. Searches of these databases (see links below) will often assign domains easily.

Page 27: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Locating domains• Regions of low-complexity often separate domains in

multi-domain proteins. Long stretches of repeated residues, particularly Proline, Glutamine, Serine or Threonine often indicate linker sequences and are usually a good place to split proteins into domains.

• Low complexity regions can be defined using the program SEG which is generally available in most BLAST distributions or web servers.

• Transmembrane segments are also very good dividing points, since they can easily separate extracellular from intracellular domains.

Page 28: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Locating Domains• Something else to consider are the presence of coiled-coils. These

unusual structural features sometimes (but not always) indicate where proteins can be divided into domains.

• Secondary structure prediction methods will often predict regions of proteins to have different protein structural classes. For example, one region of a sequence may be predicted to contain only helices and another to contain only sheets. These can often, though not always, suggest likely domain structure.

• If a sequence has been separated into domains, then it is very important to repeat all the database searches and alignments using the domains separately. Searches with sequences containing several domains may not find all sub-homologies, particularly if the domains are abundant in the database (e.g. kinases, SH2 domains, etc.).

Page 29: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Domain Assignment

Page 30: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Locating Domains by Web Sites• SMART (Oxford/EMBL)

– http://smart.embl-heidelberg.de/

• PFAM (Sanger Center/Wash-U/Karolinska Intitutet)– http://www.sanger.ac.uk/Software/Pfam/search.shtml

• COGS (NCBI) • PRINTS (UCL/Manchester) • BLOCKS (Fred Hutchinson Cancer Research Center,

Seattle)– http://blocks.fhcrc.org/blocks/blocks_search.html

• SBASE (ICGEB, Trieste) • Domain descriptions can also be located in the

annotations in SWISSPROT.

Page 31: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure
Page 32: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

P68 RNA Helicase • ssyssdrdr grdrgfgapr fggsrtgpls gkkfgnpgek

lvkkkwnlde lpkfeknfyq ehpdlarrta qevdtyrrsk eitvrghncp kpvlnfyean fpanvmdvia rhnfteptai

qaqgwpvals gldmvgvaqt gsgktlsyll paivhinhhp flergdgpic lvlaptrela qqvqqvaaey cracrlkstc iyggapkgpq irdlergvei ciatpgrlid flecgktnlr rttylvldea drmldmgfep qirkivdqir pdrqtlmwsa twpkevrqla edflkdyihi nigalelsan hnilqivdvc hdvekdekli rlmeeimsek enktivfvet krrcdeltrk mrrdgwpamg ihgdksqqer dwvlnefkhg kapiliatdv asrgldvedv kfvinydypn ssedyihrig rtarstktgt aytfftpnni kqvsdlisvl reanqainpk llqlvedrgs

grsrgrggmk ddrrdrysag krggfntfrd renydrgysn llkrdfgakt qngvysaany tngsfgsnfv sagiqtsfrt gnptgtyqng ydstqqygsn vanmhngmnq qayaypvpqp

apmigypmpt gysq 614 aa

f015812 (Genebank)

Page 33: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure
Page 34: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Sequence Alignment of p68 to DEAD Proteins

Walker A

AXTGSGKT Walker A motif for ATP bindingDEAD ATP binding, ATP hydrolysisSAT Transmission energy from ATP to unwind RNA

Page 35: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

P68 RNA Helicase

AXXGXGKT PTRELA GG TPGR DEAD SAT RGXD HRIGRXXR

ATPase Helicase RNA binding

I Ia Ib II III IV V VI

ATPase A ATPase B

GXGKT PXRXXA TXGX DEXH S/TAT XRXGRXXR

DEAD

DEXH

Figure 1. Top panel is the domain structure of p68 RNA helicase. The numbers represent amino acid residue number. Bottom panels are the RNA helicase-core region of DEAD-box and DEXH-box proteins. The conserved sequence motifs and their putative functions are indicated.

p68

N-terminal Helicase-core C-terminal

130 435 614

IQmotif

30 80 477 504 554 560

RGGrepeats

PutativeHTH

Page 36: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Comparative or Homology Modeling• If the protein sequence shows significant

homology to another protein of known three-dimensional structure, then a fairly accurate model of the protein 3D structure can be obtained via homology modeling.

• It is also possible to build models if one has found a suitable fold via fold recognition and is satisfied with the alignment of sequence to structure (Note that the accuracy of models constructed in this manner has not been assessed properly, so treat with caution).

Page 37: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Comparative or Homology Modeling• It is possible now to generate models automatically using the

very useful SWISSMODEL server. It is possible to send in a protein sequence only when the degree of sequence homology is high (50% or greater). It is best, particularly if one has edited an alignment, to send an alignment directly to the server.

– http://www.expasy.ch/swissmod/SWISS-MODEL.html

Some other sites useful for homology modeling include:

• WHAT IF (G. Vriend, EMBL, Heidelberg) – http://www.cmbi.kun.nl/whatif/

• MODELLER (A. Sali, Rockefeller University) – http://guitar.rockefeller.edu/modeller/modeller.html

• MODELLER Mirror FTP site

Page 38: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure
Page 39: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

• EIF-4A is the initiation factor (1QAV) with 1.8 Å resolution.

Walker AAQSGTGKT

DEAD

Swiss-Model of P68 Based on EIF-4A

SAT

Page 40: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure
Page 41: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Methods for Single Sequences• Secondary structure prediction has been around for almost a

quarter of a century. The early methods suffered from a lack of data. Predictions were performed on single sequences rather than families of homologous sequences, and there were relatively few known 3D structures from which to derive parameters. Probably the most famous early methods are those of Chou & Fasman, Garnier, Osguthorbe & Robson (GOR) and Lim.

• Although the authors originally claimed quite high accuracies (70 - 80 %), under careful examination, the methods were shown to be only between 56 and 60 % accurate (Kabsch & Sander, 1984). An early problem in secondary structure prediction had been the inclusion of structures used to derive parameters in the set of structures used to assess the accuracy of the method.

Page 42: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Methods for Single SequencesEarly methods on single sequences

• Chou, P.Y. & Fasman, G.D. (1974). Biochemistry, 13, 211-222.• Lim, V.I. (1974). Journal of Molecular Biology, 88, 857-872. • Garnier, J., Osguthorpe, D.~J. \& Robson, B. (1978).Journal of

Molecular Biology, 120, 97-120. • Kabsch, W. & Sander, C. (1983). FEBS Letters, 155, 179-182. (An

assessment of the above methods) Later methods on single sequences

• Deleage, G. & Roux, B. (1987). Protein Engineering , 1, 289-294 (DPM)

• Presnell, S.R., Cohen, B.I. & Cohen, F.E. (1992). Biochemistry, 31, 983-993.

• Holley, H.L. & Karplus, M. (1989). Proceedings of the National Academy of Science, 86, 152-156.

• King, R. & Sternberg, M. J.E. (1990). Journal of Molecular Biology, 216, 441-457.

• D. G. Kneller, F. E. Cohen & R. Langridge (1990) Improvements in Protein Secondary Structure Prediction by an

• Enhanced Neural Network, Journal of Molecular Biology, 214, 171-182. (NNPRED)

Page 43: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure
Page 44: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Assignment of Amino Acids

Page 45: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Frequency of Occurrence of Amino Acids in the Turns

Page 46: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure
Page 47: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure
Page 48: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure
Page 49: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Secondary Structure Prediction Methods & Links There are now many web servers for structure prediction, here is a

quick summary:• PSI-pred (PSI-BLAST profiles used for prediction; David Jones,

Warwick) • JPRED Consensus prediction (Cuff & Barton, EBI)

– http://barton.ebi.ac.uk/servers/jpred.html• PREDATORFrischman & Argos (EMBL)

– http://www.embl-heidelberg.de/cgi/predator_serv.pl• PHD home page Rost & Sander, EMBL, Germany

– http://www.embl-heidelberg.de/predictprotein/predictprotein.html• ZPRED server Zvelebil et al., Ludwig, U.K.• http://kestrel.ludwig.ucl.ac.uk/zpred.html (GOR)• nnPredict Cohen et al., UCSF, USA.

– http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html• BMERC PSA Server Boston University, USA

– http://bmerc-www.bu.edu/psa/• SSP (Nearest-neighbor) Solovyev and Salamov, Baylor College,

USA. – http://dot.imgen.bcm.tmc.edu:9331/pssprediction/pssp.html

Page 50: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Recent Improvements• The availability of large families of homologous sequences

revolutionized secondary structure prediction.

• Traditional methods, when applied to a family of proteins rather than a single sequence, proved much more accurate at identifying core secondary structure elements. The combination of sequence data with sophisticated computing techniques such as neural networks has lead to accuracies well in excess of 70 %. Though this seems a small percentage increase, these predictions are actually much more useful than those for single sequence, since they tend to predict the core accurately.

• Moreover, the limit of 70 – 80 % may be a function of secondary structure variation within homologous proteins.

Page 51: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure
Page 52: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Automated MethodsThere are numerous automated methods for predicting secondary structure

from multiply aligned protein sequences. Some good references are: • Zvelebil, M.J.J.M., Barton, G.J., Taylor, W.R. & Sternberg, M.J.E. (1987). Prediction of

Protein Secondary Structure and Active Sites Using the Alignment of Homologous Sequences Journal of Molecular Biology, 195, 957-961. (ZPRED)

• Rost, B. & Sander, C. (1993), Prediction of protein secondary structure at better than 70 % Accuracy, Journal of Molecular Biology, 232, 584-599. PHD)

• Salamov A.A. & Solovyev V.V. (1995), Prediction of protein secondary sturcture by combining nearest-neighbor algorithms and multiply sequence alignments. Journal of Molecular Biology, 247,1 (NNSSP)

• Geourjon, C. & Deleage, G. (1994), SOPM : a self optimised prediction method for protein secondary structure prediction. Protein Engineering, 7, 157-16. (SOPMA)

• Solovyev V.V. & Salamov A.A. (1994) Predicting alpha-helix and beta-strand segments of globular proteins. (1994) Computer Applications in the Biosciences,10,661-669. (SSP)

• Wako, H. & Blundell, T. L. (1994), Use of amino-acid environment-depdendent substitution tables and conformational propensities in structure prediction from aligned sequences of homologous proteins. 2. Secondary Structures, Journal of Molecular Biology, 238, 693-708.

• Mehta, P., Heringa, J. & Argos, P. (1995), A simple and fast approach to prediction of protein secondary structure from multiple aligned sequences with accuracy above 70 %. Protein Science, 4, 2517-2525. (SSPRED)

• King, R.D. & Sternberg, M.J.E. (1996) Identification and application of the concepts important for accurate and reliable protein secondary structure prediction. Protein Sci,5, 2298-2310. (DSC).

Page 53: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure
Page 54: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

PHD Prediction of rCD2

Page 55: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Comparison Between

Prediction & X-ray

Page 56: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Manual Intervention• It has long been recognized that patterns of residue

conservation are indicative of particular secondary structure types.

• Alpha helices have a periodicity of 3.6, which means that for helices with one face buried in the protein core, and the other exposed to solvent, the residues at positions i, i+3, i+4 & i+7 (where i is a residue in an helix) will lie on one face of the helix. Many alpha helices in proteins are amphipathic, meaning that one face is pointing towards the hydrophobic core and the other towards the solvent. Thus patterns of hydrophobic residue conservation showing the i, i+3, i+4, i+7 pattern are highly indicative of an alpha helix.

Page 57: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Pattern in Amphipathic Helix

• For example, this helix in myoglobin has a classic pattern of hydrophobic and polar residue conservation (i = 1).

Page 58: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Pattern in Amphipathic Beta Strand

• The geometry of beta strands means that adjacent residues have their side chains pointing in opposite directions.

• Beta strands that are half buried in the protein core will tend to have hydrophobic residues at positions i, i+2, i+4, i+8, etc, and polar residues at positions i+1, i+3, i+5, etc.

Page 59: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Pattern in Buried Beta Strand

• Beta strands that are completely buried (as is often the case in proteins containing both alpha helices and beta strands) usually contain a run of hydrophobic residues, since both faces are buried in the protein core.

Page 60: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure
Page 61: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure
Page 62: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure
Page 63: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure
Page 64: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure
Page 65: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Secondary Structure Prediction of CD2 x-structure A B C C' C" 1 10 20 30 40 50

Rat CD2 RDSGTVWGALGHGINLNIPNFQMTDDIDEVRWERGSTLVAEFKRKMKPFLK PHD CCCCSSSSCCCCCSSSCCCCCCCCCCHHHHHHHHCCHHHHHHHHHCCCCSS GOR CCCCSSSSSSSCCCSCCCCCCCCCCCHCHSSHHHCCHHHHHHHHHHHHHHH SOPMA CCCCSSHCCCCCCSSSCCCCCCCCCCCCHSSHHCCCSHHHHHHHHHHHHHC x-structure D E F G

60 70 80 90 Rat CD2 SGAFEILANGDLKIKNLTRDDSGTYNVTVYSTNGTRILNKALDLRILE

PHD CCCSSSSSCCCSSSCCCCCCCCCCSSSSSSCCCHHHHHHHHCCCCCCC GOR HHHHHHHHHHHHHHHSSSSCCCCSSSSSSSSCCCCSSHHHHHHHHHHH SOPMA CCCSSSSCCCCSSSSSSCCCCCCCSSSSSSSCCCCSSSSHHHHHSSHC

H = -helix S = -sheet C = coil

-sheet -helix

Page 66: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

CD2 vs. Helical Propensity

• Residues on strands C, C’, C” and G have strong helical propensity

C

N

A

B

CC'C"

D

E

F

G

Y76

V39L16

W32

V78

F42

Page 67: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Three automated secondary structure predictions (PHD, SOPMA and SSPRED) appear below the alignment of 12 glutamyl tRNA reductase sequences. Positions within the alignment showing a conservation of hydrophobic side-chain character are shown in yellow, and those showing near total conservation of non-hydrophobic residues (often indicative of active sites) are colored green.

Page 68: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

• Predictions of accessibility performed by PHD (PHD Acc. Pred.) are also shown (b = buried, e = exposed).

• For example, positions (within the alignment) 38 - 45 exhibit the classical amphipathic helix pattern of hydrophobic residue conservation, with positions i, i+3, i+4 and i+7 showing a conservation of hydrophobicity, with intervening positions being mostly polar.

• Positions 13 - 16 comprise a short stretch of conserved hydrophobic residues, indicative of a buried beta-strand.

Page 69: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Alignment of Sequence to Tertiary Structure

• Remember that the alignments of sequence for tertiary structure that one gets from fold recognition methods may be inaccurate. In instances where one has identified a remote homologue, then the fold recognition methods can sometimes give a very accurate alignment, though it is still sometimes fruitful to edit the alignment around variable regions.

• In other cases, it may be wise to create an alignment by starting with the alignment from the fold recognition method, and considering the alignment of secondary structures.

Page 70: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Alignment of Sequence to Tertiary Structure

There is one suggested method by Dr. Robert B. Russell:• Ensure that residues predicted to be buried/exposed align

to those known to be buried or exposed in the template structure. Note that conserved hydrophobic/polar residues are more likely to be buried/exposed than non-conserved residues, which could simply be anomalies. One can predict residue accessibility manually, or by use of an automated server like PHD.

• Ensure that critical hydrogen bonding patterns are not disrupted in beta-sheet structures.

• Attempt to conserve residue properties (i.e. size, polarity, hydrophobicity) as best as possible across known and unknown structure.

Page 71: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Things Need to be ConsideredIn the construction of an alignment, several things need be considered:• The observed residue burial or exposure • The predicted residue burial or exposure • The conservation of residue properties in known and

unknown structures • Whether or not the side chains on the core beta-strands

pointed in towards the barrel or out towards the helices • The hydrogen bonding pattern of the beta-strands

comprising the core beta-barrel.

Page 72: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Alignment of the Prediction of the Glutamyl tRNA Reductases (hemA) with an Alpha/beta Barrel Structure (2acs)

Page 73: Protein Structure Prediction Sequence database searching Domain assignment Multiple sequence alignment Comparative or homology modeling Secondary structure

Alignment of the Prediction of the Glutamyl tRNA Reductases (hemA) with an Alpha/beta Barrel Structure (2acs)

• Sec. = known secondary structure from PDB code 2ACS (E = extended, H = alpha helix, G = 310 helix, B = beta-bridge);

• Bur. = known residue exposure for 2ACS (b = buried, h = half-buried, e = exposed); in/out = positioning of residues in the beta-barrel (i = pointing inwards, o = pointing outwards);

• Res. cons = conservation of residues (totally conserved = UPPER CASE, h = hydrophobic, p = polar, c = charged, a = aromatic, s = small, - = negative, + = positive) Pred denotes predicted burial and secondary structure for the glutamyl tRNA reductase family;

• Boxed positions are those with the same known/predicted burial. Shaded positions show a conservation of hydrophobic character in BOTH families of proteins, and positions in inverse text show a conservation of polar character in BOTH families.