prediction of protein structure in 1d

21
2 o structure, TM regions, and solvent accessibility Topic 13 er 29, Du and Bourne “Structural Bioinformatics”

Upload: arnaud

Post on 23-Feb-2016

49 views

Category:

Documents


0 download

DESCRIPTION

Prediction of Protein Structure in 1D. 2 o structure, TM regions, and solvent accessibility. Topic 13. Chapter 29, Du and Bourne “Structural Bioinformatics”. The Truth (Information) is Out (In) There. The Truth (Information) is Out (In) There. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Prediction of Protein Structure in 1D

2o structure, TM regions, and solvent accessibility

Topic 13Chapter 29, Du and Bourne “Structural Bioinformatics”

Page 2: Prediction of Protein Structure in 1D

The Truth (Information) is Out (In) There

Page 3: Prediction of Protein Structure in 1D

The Truth (Information) is Out (In) There

But we’re still having a tough time finding it.

Page 4: Prediction of Protein Structure in 1D

Given a protein sequence (primary structure), predict its secondary structures

GHWIATRGQLIREAYEDYRHFSSECPFIP

CEEEEECCCEEEEECCCHHHHHHCCCCCC

E: -strandH: -helixC: coil

Assumption: short stretches of residues have propensity to adopt certainconformation conformation of the central residue in a sequence fragment⇒depends only on flanking residues (sliding window)

Protein Secondary Structure Prediction

H: ( H: - helix, G: 310 helix, I: -helix ) E: (E: -strand, B: bridge) C: (T: -turn, S: bend, C: coil)

Page 5: Prediction of Protein Structure in 1D

-- Because we can (kind of).-- Because it could be a first step towards prediction of protein tertiary

structure.

Why secondary structure prediction?

“Have solution, need problem.” Nearly every imaginable algorithm has been applied to secondary structure prediction.

Page 6: Prediction of Protein Structure in 1D

1. First generation: Single amino acid propensities Chou-Fasman method (1974), GOR I-IV ~56-60% accuracy

2. Second generation: Segments of 3-51 adjacent residues NNSSP, SSPAL

~65% accuracy

3. Neural network PHD, Psi-Pred, J-Pred

4. Support vector machine (SVM)

5. Hidden Markov Models (HMM)

Third generation methodsusing evolutionary information ~76% accuracy

Secondary Structure Prediction Methods

Page 7: Prediction of Protein Structure in 1D

3

ii1

3obs

M100

NiQ

1. three-state per-residue prediction accuracy

Mii, number of residues observed in state i and predicted in state i Nobs, the total number of residues observed in 3 states

Secondary Structure Prediction Accuracy

2. per-segment prediction accuracy (SOV, Segment of OVerlap)Per-stage segment overlap:                                                 

S1: observed SS segmentS2: predicted SS segment

Page 8: Prediction of Protein Structure in 1D

Calculate the propensity for a given amino acid to adopt a certain ss-type

( | ) ( , )( ) ( ) ( )

i i i

i

P aa p aaP

p p p aa

Example: from a data set with 30 proteins#Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=580p(,aa) = 580/20,000, p() = 4,000/20,000, p(aa) = 2,000/20,000

P = 580 / (4,000/10) = 1.45

i, amino acid, secondary structure state

Single Residue Propensity Methods

Page 9: Prediction of Protein Structure in 1D

Amino Acid Propensities to Secondary Structures

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57

Chou-Fasman method

T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57

Page 10: Prediction of Protein Structure in 1D

* The idea is simple: predict SS of the central residue of a given segment from homologous segments (neighbors).

For example, from database, find some number of the closest sequences to a subsequence defined by a window around the central residue, then use max (N, N, Nc) to assign the SS.

Nearest Neighbor Methods

RSTEVRASRQLAKEKVN

Window size

Homologous sequences

ECCHHCC

C

Key parameters:1. How to define similarity?2. What size window of sequence should be examined?3. How many close sequences should be selected?

Page 11: Prediction of Protein Structure in 1D

The Devil is in the details…

Page 12: Prediction of Protein Structure in 1D

D. Jones, J. Mol. Boil. 292, 195 (1999). Method : Neural network Input data : PSSM generated by PSI-BLAST Bigger and better sequence database

Combining several database and data filtering Training and test sets preparation

Ss prediction only makes sense for proteins with no homologous structure.

No sequence & structural homologues between training and test sets by CATH and PSI-BLAST (mimicking realistic situation).

Psi-Pred Method

Page 13: Prediction of Protein Structure in 1D

Window size = 15 Two networks First network (sequence-to-structure):

315 = (20 + 1) 15 inputs extra unit to indicate where the windows spans either N or C terminus Data are scaled to [0-1] range by using 1/[1+exp(-x)] 75 hidden units 3 outputs (H, E, L)

Second network (structure-to-structure): Structural correlation between adjacent sequences 60 = (3 + 1) 15 inputs 60 hidden units 3 outputs

Accuracy ~76%

Psi-Pred Method--Neural Network

Page 14: Prediction of Protein Structure in 1D

Conf: Confidence (0=low, 9=high) ---very important!!!!Pred: Predicted secondary structure (H=helix, E=strand, C=coil) AA: Target sequence # PSIPRED HFORMAT (PSIPRED V2.3 by David Jones) Conf: 966899999997542002357777557999999716898188034435788873356776 Pred: CCHHHHHHHHHHHHHHHCCCCCCCHHHHHHHHHHHCCCCCEEECCCCEEEEEEECCCCCC AA: MMWEQFKKEKLRGYLEAKNQRKVDFDIVELLDLINSFDDFVTLSSCSGRIAVVDLEKPGD 10 20 30 40 50 60

Conf: 777179998337888888988751235636899718261220179868899999998557 Pred: CCCCEEEEEECCCCCHHHHHHHHHCCCCCEEEEECCCEEEEECCCHHHHHHHHHHHHHCC AA: KASSLFLGKWHEGVEVSEVAEAALRSRKVAWLIQYPPIIHVACRNIGAAKLLMNAANTAG 70 80 90 100 110 120

Conf: 200242314703799714651435541487355188999999999999999889999999 Pred: CCCCCCEECCCEEEEEECCCEEEEEECCCCCEEECHHHHHHHHHHHHHHHHHHHHHHHHH AA: FRRSGVISLSNYVVEIASLERIELPVAEKGLMLVDDAYLSYVVRWANEKLLKGKEKLGRL 130 140 150 160 170 180

Sample Psi-Pred Output

***Compare the prediction for residues 9 and 17***

Page 15: Prediction of Protein Structure in 1D

Sample Psi-Pred Output-II

Page 16: Prediction of Protein Structure in 1D

Again, voting rules methods tend to be bestATKAVCVLKGDGPVQGTIHFEAKGDTVVVTGSITGLTEGDHGFHVHQFGDNTQGCTSAGP 2SODCCCCCCCCCCCCCCCCEEHCCHHECEEEEEEEEEEEECCCCCCCCCCCCCCCCCCCCCCC BPSCCHEEEEECCCCCCCCEEEHHHCCCEEEEEEEEECECCCCCCEEEECCCCCCCCCCCCCC D_RCCCEEEEEECCCCCEEEEEEEECCCEEEEEEEEEEEECCCCCEEEEECCCCCCCCCCCCC DSCCCCEEEEECCCCCCCEEEEEECCCCEEEEEEEEECCCCCCCCEEEEEECCCCCCCCCCCC GGRHHHCEEEECCCCCCCEEEEEECCCCEEEEEECEEEEEECCCCEEEEECCCCCCEEECCCC GORCCCCEEEECCCCCCCCCEEECCCCCCEEEEECEEECCCCCCCEEEECCCCCCCCEEECCC H_KCCCCEEEEECCCCCCCCCEEECCCCCEEEECCCCCCCCCCCEEEEEEEECCCCCCCCCCC K_SCCCCEEEECCCCCCCCEEEEECCCCEEEEEEEEEEECCCCCCEEEEECCCCCCCCCCCCC JOI---EEEEE------EEEEEEEEE--EEEEEEEEE-----EEEEEEEE------------- 2SOD HFNPLSKKHGGPKDEERHVGDLGNVTADKNGVAIVDIVDPLISLSGEYSIIGRTMVVHEK 2SODCCCCCCCCCCCCCCCCCCCCCCECCCCCCHEECCCCCCCCCECCEECEEEEEEEEEEECC BPSCCCCCCCCCCCCCCCHHCECCCCCECCCCCCEEEEEEECCEEEECCCEEEEEEEEEEECC D_RCCCCCCCCCCCCCCEEEEECCCCCCCCCCCCEEEEEECCCCCCCCCCEEEEEEEEEEECC DSCCCCCCCCCCCCCCCCCEEECCCCCCCCCCCCCEEEEECCCCCCCCCCEEEECEEEEEECC GGRCCCCCCCCCCCCCCHHEEECCCCCCCCCCCCEEEEEEECCEEECCCCEEEEEEEEEECCC GORCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCEECCCCCCCCCCCCCCHHHHHHEECCC H_KCCCCCCCCCCCCCCCCEEECCCCCCCCCCCCCEEEEEEEEEEEEECCCEEECCEEEEEEE K_SCCCCCCCCCCCCCCCCEEECCCCCCCCCCCCEEEEEECCCCECCCCCEEEEEEEEEEECC JOI--------------------EEEEEE------EEEEEEE--------------EEEEE-- 2SOD

Page 17: Prediction of Protein Structure in 1D

0

5

10

15

20

25

30 40 50 60 70 80 90 100

PSIPREDSSproPROFPHDpsiJPred2PHD

Perc

enta

ge o

f all

150

prot

eins

Percentage correctly predicted residues per protein

Prediction Accuracy (EVA)

EVA: Automatic evaluation of prediction servers

Page 18: Prediction of Protein Structure in 1D

Currently ~76%

Proteins with more than 100 homologues 80%

Assignment is ambiguous (5-15%). Recall DSSP vs STRIDE. -- non-unique protein structures (dynamic), H-bond cutoff, etc.

Different secondary structures between homologues (~12%).

Non-locality. Secondary structure is influenced by long-range interactions. -- Some segments can have multiple structure types (chameleon

sequences).

How Far Can We Go?

Page 19: Prediction of Protein Structure in 1D

Conceptually similar problem to SS prediction: Buried vs. Exposed. Weighted Ensemble Solvent Accessibility predictor: http://pipe.scs.fsu.edu/

wesa.html

Solvent accessibility

EE E E E

E

B

B

B B

B

B

Page 20: Prediction of Protein Structure in 1D

To provide structural context for putative mutations that one wants to characterize biochemically or biophysically.

Why bother?

Page 21: Prediction of Protein Structure in 1D

Again, conceptually similar problem to SS prediction: TM vs. Not.

Transmembrane Segment Prediction