protein structure and homology modeling

43
Protein structure and homology modeling Morten Nielsen, CBS, BioCentrum, DTU

Upload: lida

Post on 09-Jan-2016

52 views

Category:

Documents


3 download

DESCRIPTION

Protein structure and homology modeling. Morten Nielsen, CBS, BioCentrum, DTU. Objectives. Understand the basic concepts of homology modeling Learn why even sequences with very low sequence similarity can be modeled Understand why is %id such a terrible measure for reliability - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Protein structure and homology modeling

Protein structure and homology modeling

Morten Nielsen,CBS, BioCentrum,

DTU

Page 2: Protein structure and homology modeling

Objectives

• Understand the basic concepts of homology modeling

• Learn why even sequences with very low sequence similarity can be modeled– Understand why is %id such a terrible measure for reliability

• See the beauty of sequence profiles?• Learn where to find the best public methods

Page 3: Protein structure and homology modeling

Outline

• Why homology modeling• How is it done • How to decide when to use homology modeling– Why is %id such a terrible measure

• What are the best methods?• Models in immunology

Page 4: Protein structure and homology modeling

Why protein modeling?

• Because it works!– Close to 50% of all new sequences can be homology modeled

• Experimental effort to determine protein structure is very large and costly

• The gap between the size of the protein sequence data and protein structure data is large and increasing

Page 5: Protein structure and homology modeling

Homology modeling and the human genome

Page 6: Protein structure and homology modeling

Swiss-Prot database

~200.000 in Swiss-Prot~ 2.000.000 if include Tremble

Page 7: Protein structure and homology modeling

PDB New Fold Growth

• The number of unique folds in nature is fairly small (possibly a few thousands)

• 90% of new structures submitted to PDB in the past three years have similar structural folds in PDB

New folds

Old folds

New PDB structures

Page 8: Protein structure and homology modeling

Identification of fold

If sequence similarity is high proteins share structure (Safe zone)

If sequence similarity is low proteins may share structure (Twilight zone)

Most proteins do not have a high sequence homologous partner

Rajesh Nair & Burkhard Rost Protein Science, 2002, 11, 2836-47

Page 9: Protein structure and homology modeling

Why %id is so bad!!

1200 models sharing 25-95% sequence identity with the submitted sequences (www.expasy.ch/swissmod)

Page 10: Protein structure and homology modeling

Identification of correct fold

• % ID is a poor measure– Many evolutionary related proteins share low sequence homology

• Alignment score even worse– Many sequences will score high against every thing (hydrophobic stretches)

• P-value or E-value more reliable

Page 11: Protein structure and homology modeling

What are P and E values?

• E-value– Number of expected hits in database with score higher than match

– Depends on database size

• P-value – Probability that a random hit will have score higher than match

– Database size independent

Score

P(Score)

Score 15010 hits with higher score (E=10)10000 hits in database => P=10/10000 = 0.001

Page 12: Protein structure and homology modeling

How to do it

Identify fold (template) for modeling– Find the structure in the PDB database that resembles your new protein the most

– Can be used to predict function

Align protein sequence to template– Simple alignment methods

– Sequence profiles– Threading methods– Pseudo force fields

Model side chains and loops

Page 13: Protein structure and homology modeling

Template identification

Simple sequence based methods– Align (BLAST) sequence against sequence of proteins with known structure (PDB database)

Sequence profile based methods– Align sequence profile (Psi-BLAST) against sequence of proteins with known structure (PDB)

– Align sequence profile against profile of proteins with known structure (FFAS)

Sequence and structure based methods– Align profile and predicted secondary structure against proteins with known structure (3D-PSSM)

Page 14: Protein structure and homology modeling

Sequence profiles

In conventional alignment, a scoring matrix (BLOSUM62) gives the score for matching two amino acids– In reality not all positions in a

protein are equally likely to mutate– Some amino acids (active cites) are

highly conserved, and the score for mismatch must be very high

– Other amino acids are mutate almost for free, and the score for mismatch is lower than the BLOSUM score

Sequence profiles can capture these differences

Page 15: Protein structure and homology modeling

Protein world

Protein fold

Protein structure classification

Protein superfamily

Protein familyNew Fold

Page 16: Protein structure and homology modeling

ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I-TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---IIE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD----TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---VASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE----TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI

Sequence profiles

Conserved

Non-conserved

Matching any thing but G => large negative score

Any thing can match

TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP

TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI

Page 17: Protein structure and homology modeling

Sequence profiles

Align (BLAST) sequence against large sequence database (Swiss-Prot)

Select significant alignments and make profile (weight matrix) using techniques for sequence weighting and pseudo counts

Use weight matrix to align against sequence database to find new significant hits

Repeat 2 and 3 (normally 3 times!)

Page 18: Protein structure and homology modeling

Example.

>1K7C.A TTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCL

• What is the function• Where is the active site?

Page 19: Protein structure and homology modeling

Example.

• Function• Run Blast against PDB

•No significant hits

• Run Blast against NR (Sequence database)•Function is Acetylesterase?

• Where is the active site?

Page 20: Protein structure and homology modeling

Example. Where is the active site?

1WAB Acetylhydrolase

1G66 Acetylxylan esterase

1USW Hydrolase

Page 21: Protein structure and homology modeling

Example. Where is the active site?

• Align sequence against structures of known acetylesterase, like• 1WAB, 1FXW, …

• Cannot be aligned. Too low sequence similarity

1K7C.A 1WAB._ RMSD 11.2397QAL 1K7C.A 71 GHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFDAL 1WAB._ 160 GHPRAHFLDADPGFVHSDGTISH--HDMYDYLHLSRLGY

Page 22: Protein structure and homology modeling

Example. Where is the active site?

• Sequence profiles might show you where to look!• The active site could be around

• S9, G42, N74, and H195

Page 23: Protein structure and homology modeling

Example. Where is the active site?

Align using sequence profiles

ALN 1K7C.A 1WAB._ RMSD = 5.295221K7C.A TVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDN S G N1WAB._ EVVFIGDSLVQLMHQCE---IWRELFS---PLHALNFGIGGDSTQHVLW--RLENGELEHIRPKIVVVWVGTNNHG------

1K7C.A GRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAK--GAKVILSSQTPNNPWETGTFVNSPTRFVEYAEL-AAEVA1WAB._ ---------------------HTAEQVTGGIKAIVQLVNERQPQARVVVLGLLPRGQ-HPNPLREKNRRVNELVRAALAGHP

1K7C.A GVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSL H1WAB._ RAHFLDADPG---FVHSDG--TISHHDMYDYLHLSRLGYTPVCRALHSLLLRL---L

Page 24: Protein structure and homology modeling

Structural superposition

Blue: 1K7C.ARed: 1WAB._

Page 25: Protein structure and homology modeling

Where was the active site?

Rhamnogalacturonan acetylesterase (1k7c)

Page 26: Protein structure and homology modeling

Including structure

• Sequence with in a protein superfamily share remote sequence homology

• , but they share high structural homology

• Structure is known for template• Predict structural properties for query

– Secondary structure– Surface exposure

• Position specific gap penalties derived from secondary structure and surface exposure

Page 27: Protein structure and homology modeling

Structure biased alignment (3D-PSSM)

http://www.sbg.bio.ic.ac.uk/~3dpssm/

Page 28: Protein structure and homology modeling

CASP. Which are the best methods

• Critical Assessment of Structure Predictions

• Every second year• Sequences from about-to-be-solved-structures are given to groups who submit their predictions before the structure is published

• Modelers make prediction• Meeting in December where correct answers are revealed

Page 29: Protein structure and homology modeling

CASP6 results

Page 30: Protein structure and homology modeling

The top 4 homology modeling groups in CASP6

• All winners use consensus predictions

– The wisdom of the crowd• Same approach as in CASP5!• Nothing has happened in 2 years!

Page 31: Protein structure and homology modeling

The Wisdom of the Crowds

The Wisdom of Crowds. Why the Many are Smarter than the Few. James Surowiecki

One day in the fall of 1906, the British scientist Fracis Galton left his home and headed for a country fair… He believed that only a very few people had the characteristics necessary to keep societies healthy. He had devoted much of his career to measuring those characteristics,

in fact, in order to prove that the vast majority of people did not have them. … Galton came across a weight-judging competition…Eight hundred people tried their luck. They were a

diverse lot, butchers, farmers, clerks and many other no-experts…The crowd had guessed … 1.197

pounds, the ox weighted 1.198

Page 32: Protein structure and homology modeling

The wisdom of the crowd!

– The highest scoring hit will often be wrong•Not one single prediction method is consistently best

– Many prediction methods will have the correct fold among the top 10-20 hits

– If many different prediction methods all have a common fold among the top hits, this fold is probably correct

Page 33: Protein structure and homology modeling

3D-Jury (Best group)

Inspired by Ab initio modeling methods– Average of frequently obtained low energy

structures is often closer to the native structure than the lowest energy structure

Find most abundant high scoring model in a list of prediction from several predictors– Use output from a set of servers– Superimpose all pairs of structures– Similarity score Sij = # of Ca pairs within

3.5Å (if #>40;else Sij=0)– 3D-Jury score = SijSij/(N+1)

Similar methods developed by A Elofsson (Pcons) and D Fischer (3D shotgun)

Page 34: Protein structure and homology modeling

How to do it? Where is the crowd

• Meta prediction server – Web interface to a list of public protein structure prediction servers

– Submit query sequence to all selected servers in one go

http://bioinfo.pl/meta/

Page 35: Protein structure and homology modeling
Page 36: Protein structure and homology modeling

Meta Server

Evaluating the crowd.

Page 37: Protein structure and homology modeling

Meta Server

Evaluating the crowd. 3D Jury

Page 38: Protein structure and homology modeling

From fold to structure

Flying to the moon has not made man conquer space

Finding the right fold does not allow you to make accurate protein models– Can allow prediction of protein function

Alignment is still a very hard problem– Most protein interactions are determined by the loops, and they are the least conserved parts of a protein structure

Page 39: Protein structure and homology modeling

Modeling of newfold proteins• Only when every thing else

fails• Challenge• Close to impossible to model Natures folding potential

Ab initio protein modeling

Page 40: Protein structure and homology modeling

• New folds are in general constructed from a set of subunits, where each subunit is part of a known fold.• The subunits are small compared to the overall fold of the protein. No objective function exists to guide the global packing of the subunits.

dij = 6Å

Objective function

sij = 120aa

Challenge. Folding potential

Page 41: Protein structure and homology modeling

Fragments with correct local structure

Natures potential

Empirical potential

A way to solution

• Glue structure piece wise from fragments.• Guide process by empirical/statistical potential

Page 42: Protein structure and homology modeling

Example (Rosetta web server)

Rosetta predictionStructure

www.bioinfo.rpi.edu/~bystrc/hmmstr/server.php

Page 43: Protein structure and homology modeling

Take home message

• Identifying the correct fold is only a small step towards successful homology modeling

• Do not trust % ID or alignment score to identify the fold. Use p-values

• Use sequence profiles and local protein structure to align sequences

• Do not trust one single prediction method, use consensus methods (3D Jury)

• Only if every things fail, use ab initio methods