protein fold recognition morten nielsen, cbs, department of systems biology, dtu

67
Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Post on 20-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Protein Fold recognition

Morten Nielsen,CBS,

Department of Systems Biology,

DTU

Page 2: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Objectives

• Understand the basic concepts of fold recognition

• Learn why even sequences with very low sequence similarity can be modeled– Understand why is %id such a terrible

measure for reliability

• See the beauty of sequence profiles– Position specific scoring matrices (PSSMs)

Page 3: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Objectives

• and .....

• See the beauty of sequence profiles– Position specific scoring matrices (PSSMs)

Page 4: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Background. Why protein modeling?

• Because it works!– Close to 50% of all new sequences can be

homology modeled

• Experimental effort to determine protein structure is very large and costly

• The gap between the size of the protein sequence data and protein structure data is large and increasing

Page 5: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Growth of databases

Page 6: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Homology modeling and the human genome

Page 7: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

How can we do it?

• Identify template(s) – initial alignment• Can give you protein function

• Improve alignment• Can give you active site

• Backbone generation• Loop modeling

• Most difficult part

• Side chains• Refinement• Validation

Page 8: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Identification of fold

If sequence similarity is high proteins share structure (Safe zone)

If sequence similarity is low proteins may share structure (Twilight zone)

Most proteins do not have a high sequence homologous partner

Rajesh Nair & Burkhard Rost Protein Science, 2002, 11, 2836-47

Page 9: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Structural Genomics in North America

• 10 year $600 million project initiated in 2000, funded largely by NIH

• AIM: structural information on 10000 unique proteins (now 4-6000), so far 1000 have been determined

• Improve current techniques to reduce time (from months to days) and cost (from $100.000 to $20.000/structure)

• 9 research centers currently funded (2005), targets are from model and disease-causing organisms (a separate project on TB proteins)

Page 10: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Homology modeling for structural genomics

Roberto Sánchez et al. Nature Structural Biology 7, 986 - 990 (2000)

What a new fold can give

Page 11: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Example.

>1K7C.A TTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCL

• What is the function• Where is the active site?

A post doc in our group did her PhD obtaining the structure of the sequence below

Page 12: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

What would you do?

• Function• Run Blast against PDB

• No significant hits

• Run Blast against NR (Sequence database)• Function is Acetylesterase?

• Where is the active site?

Page 13: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Example. Where is the active site?

1WAB Acetylhydrolase

1G66 Acetylxylan esterase

1USW Hydrolase

Page 14: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Example. Where is the active site?

• Align sequence against structures of known acetylesterase, like• 1WAB, 1FXW, …

• Cannot be aligned. Too low sequence similarity

1K7C.A 1WAB._ RMSD 11.2397QAL 1K7C.A 71 GHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFDAL 1WAB._ 160 GHPRAHFLDADPGFVHSDGTISH--HDMYDYLHLSRLGY

Page 15: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Is it really impossible?

Protein homology modeling is only possibleif %id greater than 30-50%

WRONG!!!

!!!!

Page 16: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Why %id is so bad!!

1200 models sharing 25-95% sequence identity with the submitted sequences (www.expasy.ch/swissmod)

Page 17: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Identification of correct fold

• % ID is a poor measure– Many evolutionary related proteins

share low sequence homology– A short alignment of 5 amino acids can

share 100% id, what does this mean?• Alignment score even worse

– Many sequences will score high against every thing (hydrophobic stretches)

• P-value or E-value more reliable

Page 18: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

What are P and E values?

• E-value– Number of expected hits

in database with score higher than match

– Depends on database size

• P-value – Probability that a

random hit will have score higher than match

– Database size independent Score

P(S

core

)

Score 15010 hits with higher score (E=10)10000 hits in database => P=10/10000 = 0.001

Page 19: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

What goes wrong when Blast fails?

• Conventional sequence alignment uses a (Blosum) scoring matrix to identify amino acids matches in the two protein sequences• This scoring matrix is identical at all positions in the protein sequence!

EVVFIGDSLVQLMHQC

X X X

X X X

AGDS.GGGDS

Page 20: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Blosum scoring matrix

A R N D C Q E G H I L K M F P S T W Y VA 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

Page 21: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Alignment accuracy. Scoring functions

• Blosum62 score matrix. Fg=1. Ng=0?

• Score =2+6+6+4-1=17• Alignment

L A G D S D

F 0 -2 -3 -3 -2 -3

I 2 -1 -4 -3 -2 -3

G -4 0 6 -1 0 -1

D -4 -2 -1 6 0 6

S -2 1 0 0 4 0

L 4 -1 -4 -4 -2 -4

LAGDSI-GDS

Page 22: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

When Blast works!

1PLC

._

1PLB._

Page 23: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

When Blast fails!

1PLC

._

1PMY._

Page 24: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Sequence profiles

• In reality not all positions in a protein are equally likely to mutate

• Some amino acids (active cites) are highly conserved, and the score for mismatch must be very high

• Other amino acids can mutate almost for free, and the score for mismatch should be lower than the BLOSUM score

• Sequence profiles can capture these differences

Page 25: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Protein world

Protein fold

Protein structure hierarchy

Protein superfamily

Protein familyNew Fold

Page 26: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I-TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---IIE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD----TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---VASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE----TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI

Sequence profiles

Conserved

Non-conserved

Matching any thing but G => large negative score

Any thing can match

TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDGMERNTAGVP

Page 27: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

How to make sequence profiles

Align (BLAST) sequence against large sequence database (Swiss-Prot)

Select significant alignments and make profile (weight matrix) using techniques for sequence weighting and pseudo counts

Use weight matrix to align against sequence database to find new significant hits

Repeat 2 and 3 (normally 3 times!)

Page 28: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Protein world

Blast iterations

Protein

Page 29: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Blast2logo

Page 30: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Blast2logo

Page 31: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Blast2logoLast position-specific scoring matrix computed A R N D C Q E G H I L K M F P S T W Y V 1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 2 A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 4 A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 5 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 6 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 7 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 9 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 10 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 ..

A R N D C Q E G H I L K M F P S T W Y VA 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

Page 32: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Blast2logo

Page 33: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Blast2logo

Page 34: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Blast2logo

Last position-specific scoring matrix computed, A R N D C Q E G H I L K M F P S T W Y V 1 V -2 -4 -4 -5 -2 -4 -4 -5 -4 5 2 -4 0 -1 -4 -3 -2 -4 -2 4 2 A 5 0 -3 -3 -3 -2 1 -2 -3 0 -3 -2 -2 -4 0 0 -2 -4 -3 0 3 L -4 -5 -6 -6 -4 -5 -5 -6 -5 5 4 -5 1 -2 -5 -5 -3 -4 0 1 4 A 1 -4 -1 -1 3 -1 2 -4 -3 0 -1 -2 -3 1 -4 0 0 -4 2 2 5 E -2 0 -2 6 -6 0 4 -4 2 -5 -5 -2 -5 -6 -4 -2 0 -6 -4 -5 6 L -1 -2 -4 -4 -4 -2 -1 2 3 3 2 -1 0 -2 -5 -1 -1 -5 -3 1 7 Y -4 -5 -5 -6 -4 -5 -5 -4 0 1 4 -5 -1 3 -5 -5 -4 -3 5 3 8 I -1 -2 -5 -5 -4 -5 -2 -6 -5 4 3 -5 -1 3 -5 -4 -2 -4 -1 3 9 P 3 -4 -4 -3 -4 1 1 -4 -2 -2 -3 -2 -4 -5 6 -1 0 -5 -5 -2 10 E 2 -2 -3 -2 -3 0 1 -1 -3 -4 -3 -1 -1 -4 6 -2 -2 -4 -4 -3 ...

Page 35: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Example.

>1K7C.A TTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCL

• What is the function• Where is the active site?

Page 36: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

When Blast fails!

1K

7A

.A

1WAB._

Page 37: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Profile-profile scoring matrix

1K

7C

.A

1WAB._

Page 38: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Example. (SGNH active site)

Page 39: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Example. Where is the active site?

• Sequence profiles might show you where to look!• The active site could be around

• S9, G42, N74, and H195

Page 40: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Example. Where is the active site?

Align using sequence profiles

ALN 1K7C.A 1WAB._ RMSD = 5.29522. 14% ID1K7C.A TVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDN S G N1WAB._ EVVFIGDSLVQLMHQCE---IWRELFS---PLHALNFGIGGDSTQHVLW--RLENGELEHIRPKIVVVWVGTNNHG------

1K7C.A GRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAK--GAKVILSSQTPNNPWETGTFVNSPTRFVEYAEL-AAEVA1WAB._ ---------------------HTAEQVTGGIKAIVQLVNERQPQARVVVLGLLPRGQ-HPNPLREKNRRVNELVRAALAGHP

1K7C.A GVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSL H1WAB._ RAHFLDADPG---FVHSDG--TISHHDMYDYLHLSRLGYTPVCRALHSLLLRL---L

Page 41: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Structural superposition

Blue: 1K7C.ARed: 1WAB._

Page 42: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Where was the active site?

Rhamnogalacturonan acetylesterase (1k7c)

Page 43: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Using Iterative Blast

Page 44: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Using Iterative Blast

Page 45: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Using Iterative Blast (1st iteration)

Page 46: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Using Iterative Blast (3rd iteration)

Page 47: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Including structure

• Sequence with in a protein superfamily share remote sequence homology

• , but they share high structural homology• Structure is known for template• Predict structural properties for query

– Secondary structure– Surface exposure

• Position specific gap penalties derived from secondary structure and surface exposure

Page 48: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Using structure

Sequence & structure profile-profile based alignments– Template

• Sequence based profiles• Annotated secondary structure• Predicted secondary structure

– Query • Sequence based profile• Predicted secondary structure

– Position specific gap penalties derived from secondary structure

Page 49: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Handout exercise

Using Psi-Blast Profiles

Page 50: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

How good are we?

Page 51: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

CpHModels-3.0www.cbs.dtu.dk/services/CPHmodels-3.0/

Page 52: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

CASP8 - Ranked as 15-20 best server

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 5 10 15 20 25 30 35 Blast

Z-score

F4

Page 53: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Why did we not win?

• Multiple template modeling• First hit is not always the best

• Loop modeling• ...

Page 54: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

What are the different methods?

• Simple sequence based methods– Align (BLAST) sequence against sequence of proteins with known

structure (PDB database)• Sequence profile based methods

– Align sequence profile (Psi-BLAST) against sequence of proteins with known structure (PDB, FUGUE)

– Align sequence profile against profile of proteins with known structure (FFAS)

• Sequence and structure based methods– Align profile and predicted secondary structure against proteins

with known structure (3D-PSSM, Phyre)• Sequence profiles and structure based methods

– Hhpred, CpHModels• Multiple template methods

• Modeler (via Hhpred, 3D jury)

Page 55: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Take home message

• Identifying the correct fold is only a small step towards successful homology modeling

• Do not trust % ID or alignment score to identify the fold. Use P-values

• You can do reliable fold recognition AND homology modeling when for low sequence homology

• Use sequence profiles and local protein structure to align sequences

Page 56: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

CASP. Which are the best methods

• Critical Assessment of Structure Predictions

• Every second year• Sequences from about-to-be-solved-

structures are given to groups who submit their predictions before the structure is published

• Modelers make prediction• Meeting in December where correct

answers are revealed

Page 57: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

CASP6 results

Page 58: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

The top 4 homology modeling groups in CASP6

• All winners use consensus predictions– The wisdom of the crowd

• Same approach as in earlier CASPs

Page 59: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

The Wisdom of the Crowds

The Wisdom of Crowds. Why the Many are Smarter than the Few. James Surowiecki

One day in the fall of 1906, the British scientist Fracis Galton left his home and headed for a

country fair… He believed that only a very few people had the characteristics necessary to keep

societies healthy. He had devoted much of his career to measuring those characteristics, in fact, in order to prove that the vast majority of people

did not have them. … Galton came across a weight-judging competition…Eight hundred people tried

their luck. They were a diverse lot, butchers, farmers, clerks and many other no-experts…The

crowd had guessed … 1.197 pounds, the ox weighted 1.198

Page 60: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

The wisdom of the crowd!

– The highest scoring hit will often be wrong•Not one single prediction method is

consistently best – Many prediction methods will have the

correct fold among the top 10-20 hits– If many different prediction methods all

have a common fold among the top hits, this fold is probably correct

Page 61: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

3D-Jury

Inspired by Ab initio modeling methods– Average of frequently obtained low energy

structures is often closer to the native structure than the lowest energy structure

Find most abundant high scoring model in a list of prediction from several predictors– Use output from a set of servers– Superimpose all pairs of structures– Similarity score Sij = # of Ca pairs within 3.5Å

(if #>40;else Sij=0)– 3D-Jury score = SijSij/(N+1)

Similar methods developed by A Elofsson (Pcons) and D Fischer (3D shotgun)

Page 62: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

How to do it? Where is the crowd

• Meta prediction server – Web interface to a list of public protein

structure prediction servers– Submit query sequence to all selected

servers in one go

http://bioinfo.pl/meta/

Page 63: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU
Page 64: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU
Page 65: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Meta Server

Evaluating the crowd.

Page 66: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Meta Server

Evaluating the crowd. 3D Jury

Page 67: Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU

Take home message

• Identifying the correct fold is only a small step towards successful homology modeling

• Do not trust % ID or alignment score to identify the fold. Use p-values

• Use sequence profiles and local protein structure to align sequences

• Do not trust one single prediction method, use consensus methods (3D Jury)