![Page 1: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/1.jpg)
1
Statistical Methods for Protein Structure Prediction
![Page 2: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/2.jpg)
2
Outline
• Review statistical methods – KNN– Logistic regression
• Introduce neural networks
• Protein secondary structure prediction• Protein disorder prediction
![Page 3: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/3.jpg)
3
Logistic Regression
Given: D = {(xi, yi), i = 1…n} – dataset of labeled examples
x ∈ Rk , where k is the number of features
y ∈ {0, 1}
Task: find a line in the space of features such that positives (y = 1) andnegatives (y = 0) are best separated
k = 2
![Page 4: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/4.jpg)
4
Logistic Regression
← Form of solution
Solution →Prediction
↓
![Page 5: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/5.jpg)
5
Problem with Linear Methods
← Linearly separable?
![Page 6: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/6.jpg)
6
Extend Logistic Regression…
… to a neural network
![Page 7: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/7.jpg)
7
Non-linear Decision Boundaries
speech recognition example
![Page 8: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/8.jpg)
8
Problems in Bioinformatics?
Secondary Structure Prediction
![Page 9: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/9.jpg)
9
Why Predict Protein Structure?Many protein sequences
Few 3-D structures
0
50000
100000
150000
200000
250000
300000
1984 1986 1988 1990 1992 1994 1996 1998 2000
Year
Num
ber
of e
ntri
es
Sequences (PIR)
Structures (PDB)
http://bioinf.cs.ucl.ac.uk
![Page 10: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/10.jpg)
10
Why Predict Protein Structure?
• Experimental methods are expensive, time consuming and difficult to automate
• Predictive methods are easily automated, fast and cheap
• Can be used to improve alignment accuracy
• Can be used to detect domain boundaries within proteins with remote sequence homology
• Predicted structure gives clues about function
• Useful for mutagenesis studies
• Often the first step towards fold recognition
http://bioinf.cs.ucl.ac.uk
![Page 11: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/11.jpg)
11
An Example
![Page 12: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/12.jpg)
12
Protein StructurePrimary (Sequence)
Secondary (Helix/Strand/Coil)and lack of structure (disorder)
Quaternary (Complexes)Domain and Tertiary (Fold)
IVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWVVSAAHCYKSRIQVRLGEHNIDVLEGNEQFINAAKIITHPNFNGNTL...
http://bioinf.cs.ucl.ac.uk
![Page 13: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/13.jpg)
13
Protein Sequence
Database Searching Domain AssignmentMultiple SequenceAlignment
Homologuein PDB
ComparativeModelling
SecondaryStructure
and Disorder
Prediction
No
Yes
3-D Protein Model
FoldRecognition
PredictedFold
Sequence-StructureAlignment
Ab-initioStructurePrediction
No
Yes
Overall Approach
modified from http://bioinf.cs.ucl.ac.uk
![Page 14: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/14.jpg)
14
Secondary Structure Prediction
![Page 15: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/15.jpg)
15
Protein Secondary Structure
STRAND
HELIX
COIL
http://bioinf.cs.ucl.ac.uk
![Page 16: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/16.jpg)
16
1st Generation MethodsBased on single amino acid propertiesExamples:
• Chou & Fasman (1974)
• Lim (1974)
• Garnier, Osguthorpe & Robson (1978)
The Q3 accuracy gives the percentage of residuescorrectly predicted as Coil/Helix/Strand.
These methods had Q3 accuracies around 50-55%
http://bioinf.cs.ucl.ac.uk
![Page 17: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/17.jpg)
17
Prediction Accuracy
Qindex: (Qhelix, Qstrand, Qcoil, Q3) - percentage of residues correctly predicted as α-
helix, β-strand, coil or for all 3 conformations.
Drawbacks:- even a random assignment of structure can
achieve a high score (Holley & Karplus 1991)
1003 ⋅=residuestotal
predictedcorrectly
NN
Q
http://bioinf.cs.ucl.ac.uk
![Page 18: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/18.jpg)
18
Correlation coefficient
False negative
uα
True negative
nα
False positive
oα
True positive
pα
])][][[]([ αααααααα
αααα
α opuponunounpC ++++
= ⋅−⋅
Cα = 1 (=100%)
http://bioinf.cs.ucl.ac.uk
![Page 19: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/19.jpg)
19
Statistical Methods
From PDB database, calculate the propensity for a given amino acid to adopt a certain structural type (H, S, C)
( | ) ( , )( ) ( ) ( )
i i i
i
P aa p aaPp p p aaαα αα α
= =
Example:#Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=500P(α,aa) = 500/20,000, p(α) = 4,000/20,000, p(aa) = 2,000/20,000
P = 500 / (4,000/10) = 1.25
Used in Chou-Fasman algorithm (1974)http://bioinf.cs.ucl.ac.uk
![Page 20: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/20.jpg)
20
http://bioinf.cs.ucl.ac.uk
![Page 21: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/21.jpg)
21
Chou-Fasman: Initiation
T S P T A E L M R S T GP(H) 0.69 0.77 0.57 0.69 1.42 1.51 1.21 1.45 0.98 0.77 0.69 0.57
T S P T A E L M R S T GP(H) 0.69 0.77 0.57 0.69 1.42 1.51 1.21 1.45 0.98 0.77 0.69 0.57
Identify regions where 4/6 residues have a P(H) >1.00 “alpha-helix nucleus”
http://bioinf.cs.ucl.ac.uk
![Page 22: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/22.jpg)
22
Chou-Fasman: Propagation
T S P T A E L M R S T GP(H) 0.69 0.77 0.57 0.69 1.42 1.51 1.21 1.45 0.98 0.77 0.69 0.57
Extend helix in both directions until a set of four residues have an average P(H) <1.00
http://bioinf.cs.ucl.ac.uk
![Page 23: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/23.jpg)
23
Scan peptide for β-sheet regions
Identify regions where 3/5 have a P(E) >1.00 “β-sheet nucleus”
Extend b-sheet until 4 continuous residues have an average P(E) < 1.00
If region average > 1.05 and the average P(E) > average P(H) then “b-sheet”
T S P T A E L M R S T GP(H) 0.69 0.77 0.57 0.69 1.42 1.51 1.21 1.45 0.98 0.77 0.69 0.57P(E) 1.47 0.75 0.55 1.47 0.83 0.37 1.3 1.05 0.93 0.75 1.47 0.75
http://bioinf.cs.ucl.ac.uk
![Page 24: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/24.jpg)
24
Chou-Fasman Prediction
![Page 25: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/25.jpg)
25
Chou-Fasman Prediction
• Predict as α-helix segment with – E[Pα] > 1.03– E[Pα] > E[Pβ]– Not including proline
• Predict as β -strand segment with – E[Pβ] > 1.05– E[Pβ] > E[Pα]
• Others are labeled as turns.
(Various extensions appeared in the literature)http://bioinf.cs.ucl.ac.uk
![Page 26: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/26.jpg)
26
Chou-Fasman Prediction
• To identify a bend at residue number j, calculate the following value p(t) = f(j) · f(j+1) · f(j+2) · f(j+3)
• where the f(j+1) value for the j+1 residue is used, the f(j+2) value for the j+2 residue is used and the f(j+3) value for the j+3 residue is used.
• If: (1) p(t) > 0.000075; (2) the average value for P(turn) > 1.00 in the tetrapeptide; and (3) the averages for the tetrapeptide obey the inequality P(a-helix) < P(turn) > P(b-sheet), then a beta-turn is predicted at that location.
![Page 27: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/27.jpg)
27
• Achieved accuracy: around 52%• Shortcoming of this method: ignoring the context of the
sequence when predicting from amino-acid sequence
• We would like to use the sequence context as an input to a classifier
• There are many ways to address this.• The most successful to date are based on neural
networks
Chou-Fasman Prediction
http://bioinf.cs.ucl.ac.uk
![Page 28: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/28.jpg)
28
The “Chameleon” sequence
TEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTEK
TEAVDAWTVEKAFKTFANDNGVDGAWTVEKAFKTFTVTEK
sequence 1 sequence 2
Replace both chameleon sequences with engineered peptide (“chameleon”)
Source: Minor and Kim 1996, Nature, 380, 730-734
α -helix β-strand
University of Wyoming
![Page 29: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/29.jpg)
29
2nd Generation Methods
Based on peptide segments / residue pairsExamples:
• GOR III (1987)
• Neural Networks: Qian & Sejnowski (1988)among others
These methods had Q3 accuracies around 60-65%
http://bioinf.cs.ucl.ac.uk
![Page 30: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/30.jpg)
30
Qian-Sejnowski Architecture
......
...
...
oαoβoo
HiddenInput Output
Si
Si-w
Si+w
![Page 31: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/31.jpg)
31
3rd Generation Methods
Exploit evolutionary information. Based on conservation analysis of multiple sequence alignments.
• PHD (Q3 ~ 70%)Rost B, Sander, C. (1993) J. Mol. Biol. 232, 584-599.
• PSIPRED (Q3 ~ 77%)Jones, D. T. (1999) J. Mol. Biol. 292, 195-202.Arguably remains the top secondary structure prediction method(won all CASP competitions since 1998).
http://bioinf.cs.ucl.ac.uk
![Page 32: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/32.jpg)
32
What Patterns ofConservation Are Used?
Given a multiple sequence alignment:
• Regions of low conservation COIL
• Regions of conservation– 1,4,5,8 pattern ALPHA HELIX– All hydrophobic BURIED BETA STRAND– Alternating residues SURFACE BETA STRAND– Glycine/Proline TURN
http://bioinf.cs.ucl.ac.uk
![Page 33: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/33.jpg)
33
PSIPRED - Detecting Patterns..VQIVGGPYTCAANSI... Cascaded neural networks
structure resembles PHD
ΣCOIL
HELIX
STRAND
http://bioinf.cs.ucl.ac.uk
![Page 34: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/34.jpg)
34
Profile
AA Sequence
Position-Specific Scoring Matrix
(PSSM)
-1 -1 -1 0 -1 -2 -2 0 -2 2 -1 1 -1 4 -2 -1 2 0 -2 -1 -1 1 -2 0 -2 -2 -2-2 -1 2 -3 1 -2 -3 -1 4 -1 -1 -1 5 -2 -3 -3 -1 -1 6 5 -3 -2 -3 -1 -3 -1 -3-3 0 0 -3 0 -2 -4 0 5 0 -2 0 -1 -2 -4 -2 -1 0 -1 -1 -3 -2 -4 0 -4 7 -4-4 4 -1 -4 0 -4 -4 -1 1 -1 -2 -1 -2 -2 -4 -2 -1 -1 -2 -2 -4 -3 -4 –1 -4 1 -4-2 -4 -4 -1 -3 -3 -1 -1 -3 -1 -3 -1 4 -1 -2 -3 -1 -1 -4 -3 -1 -1 -2 –2 -2 -3 -1-1 1 1 -3 6 -2 -3 -1 0 -1 2 -1 4 -1 -3 -2 0 -1 1 0 -3 -2 -3 –1 -2 0 -3-2 5 1 -3 2 -2 -3 -1 0 -1 -1 -1 0 -1 -4 -2 -1 -1 0 -1 -3 -2 -3 –1 … -3 0 -4-3 -2 -2 -4 -2 -4 -4 -2 -1 -1 -3 -1 -3 0 -4 -3 -1 -2 -3 -3 -4 -2 -4 4 -4 -1 -4-2 -1 -1 -4 0 2 -3 -2 0 -2 -2 -1 -1 -2 -4 -3 -1 -2 -1 -1 -4 3 -3 4 -3 1 -41 -4 -3 3 -3 -2 3 -1 -4 -2 1 -2 -3 -2 5 -3 0 -1 -3 -1 2 1 4 -3 2 -4 52 -4 -3 1 -2 -1 4 -1 -3 -2 -1 -3 -2 -2 2 -3 0 -2 -3 -2 2 1 3 -3 5 -4 2-2 0 5 -3 1 -2 -3 -1 1 -1 1 -1 1 -1 -3 -1 -1 -1 3 2 -3 -2 -3 1 -3 0 -36 -3 -2 1 -1 -1 2 -1 -2 -1 -1 -2 -1 -2 1 -3 3 -1 -2 -1 1 0 1 –2 2 -3 10 -4 -4 -1 -4 3 0 -2 -3 -3 -3 -3 -3 -3 0 -4 -1 -3 -3 -3 -1 -1 0 –3 … 0 -3 0-3 -1 -1 -3 -2 -3 -3 -1 -2 -1 5 -1 -2 3 -3 8 0 -1 -2 -2 -3 -2 -3 –2 -3 -2 -3-2 0 0 -2 0 -2 -3 2 0 3 -1 4 -1 1 -3 -1 0 2 -1 -1 -2 -1 -3 2 -3 1 -3-1 -1 -1 0 -1 -2 -1 5 -1 3 -1 3 -1 -1 -1 -1 1 5 -1 1 -1 -1 -1 –1 -1 0 -1-2 -4 -3 -3 -2 2 -2 -3 -4 -3 -3 -3 -3 -3 -3 -4 -1 -3 -3 -3 -3 -3 -2 –3 -2 -4 -3-1 -3 -2 -1 -2 8 -1 -2 -2 -2 -2 -2 -2 -2 -2 -3 -1 -2 -2 -2 -2 -1 -2 –2 -1 -2 -21 -3 -3 4 -3 -1 2 0 -3 -1 -1 -2 -3 -1 2 -3 0 -1 -3 0 4 3 1 –3 1 -3 3
M D K V Q Y L T N T P S R A I P A T R R V V L G … L N I
AR N D C Q E G H I L K M F P S T W Y V
Input WindowSize = Win
Current Position
![Page 35: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/35.jpg)
35
Improvement over PHD is mainly due to using PSI-BLAST profiles
PSIPRED - Using PSI-BLAST
……………………………………………………
0.10.20.10.10.20.20.10.00.10.10.20.10.10.20.70.20.30.10.20.1
0.70.30.10.10.10.20.10.00.10.20.20.10.20.30.30.10.20.30.30.2
……………………………………………………
VYWTSPFMKLIHGEQCDNRA..VQIVGGPYTCAANSI...
1st Network315 Inputs75 Hidden Units3 Outputs
2nd Network60 Inputs60 Hidden Units3 Outputs
H/E/C3-stateprediction
http://bioinf.cs.ucl.ac.uk
![Page 36: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/36.jpg)
36
PSIPRED – David Jones JMB 1999
![Page 37: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/37.jpg)
37
![Page 38: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/38.jpg)
38
![Page 39: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/39.jpg)
39
PSIPRED Example Output
![Page 40: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/40.jpg)
40
Measures of Secondary Structure Prediction Accuracy
• Q3 scores give the percentage of correctly predicted residues across 3 states (H,E,C)
• SOV scores (Segment OVerlap) give the percentage of correctly predicted SEGMENTS across 3 states
• SSEA scores (Secondary Structure Element Alignment) give a better idea of usefulness of secondary structure prediction for use in fold recognition
http://bioinf.cs.ucl.ac.uk
![Page 41: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/41.jpg)
41
SOV• SOV scores (Segment OVerlap) give the percentage of correctly
predicted SEGMENTS across 3 states
Zemla et al. Proteins 34, 1999
![Page 42: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/42.jpg)
42
SOV• SOV scores (Segment OVerlap) give the percentage of correctly
predicted SEGMENTS across 3 states
Zemla et al. Proteins 34, 1999
![Page 43: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/43.jpg)
43
SOV
Zemla et al. Proteins 34, 1999
s1 – observed segment; s2 – predicted segment; Sα - number of all segment pairs (s1, s2) with at least 1 α-residue in common; minOV is the overlap between s1and s2; maxOV is the length in α-state of union of positions of s1 and s2; Nα - the total number of residues in α-state.
![Page 44: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/44.jpg)
44Method
Acc
urac
y S c
o re
http://bioinf.cs.ucl.ac.uk
![Page 45: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/45.jpg)
45
Secondary Structure PredictionSummary
1st Generation - 1970s• Q3 = 50-55%• Chou & Fausman, GOR
2nd Generation -1980s• Q3 = 60-65%• Qian & Sejnowski, GORIII
3rd Generation - 1990s• Q3 = 70-80%• PHD variants, PSIPRED, GOR V
4th Generation - 2000s?• Upper limit 88%?• Higher accuracy linked to database size?• Is this problem now solved?
http://bioinf.cs.ucl.ac.uk
![Page 46: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/46.jpg)
46
Bob MacCallum
![Page 47: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/47.jpg)
47
Bob MacCallum
![Page 48: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/48.jpg)
48
State of the Art
• Both PHD and Nearest neighbor get about 72%-74% accuracy– Both predicted well in CASP2 (1996)
• PSI-PRED slightly better (around 76%)• Recent trend: combining classification methods
– Best predictions in CASP3 (1998)
• Failures:– Long term effects: S-S bonds, parallel strands– Chemical patterns – Wrong prediction at the ends of helices/strands
http://bioinf.cs.ucl.ac.uk
![Page 49: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/49.jpg)
49
Decision Tree
H / ~H
E / CYes
Yes
No
No
H E C
E / ~E
C/ HYes
Yes
No
No
E C H
C / ~C
H / EYes
Yes
No
No
C H E
http://bioinf.cs.ucl.ac.uk
![Page 50: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/50.jpg)
50
Prediction of Protein Disorder
![Page 51: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/51.jpg)
51
Prediction of Protein Disorder • Relatively new field
• Implications for fold recognition
• Functionally important
• First prediction methods by Romero et al. (1997)
• Many other predictors (about 15 servers in 2006)
http://bioinf.cs.ucl.ac.uk
![Page 52: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/52.jpg)
52
Dataset construction
disordered region ordered region
Dataset consists of:152 disordered proteins (~22,000 residues) – from literature and database search
290 ordered proteins (~67,000 residues) – from the Protein Data Bank
![Page 53: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/53.jpg)
53
Disorder predictor
protein sequence
Post-Processing
Base Predictor
Attribute Construction
Predicted Disordered Regions
PSI-BLASTNRDB
• 79% on long disordered regions and 91% on ordered regions ⇒ overall prediction accuracy is 85%
profiles
20 profile attributes
![Page 54: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/54.jpg)
54
Data representation
W C Y L A M A H Q F AA A G K L K T S A L S C T
class: (0/1)(disordered/ordered)
Input Window(size = Win)
Sequence
Calculate over window:20 CompositionsK2 entropy
14Å Contact NumberHydropathyFlexibility
Coordination NumberBulkinessCFYWVolumeNet Charge
![Page 55: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/55.jpg)
55
Prediction of Disorder
sn = 76%
sp = 91%
accuracy > 80%
![Page 56: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/56.jpg)
56
Comparisons of methods
OLS: Ordinary Least Squares Regression
LR: Logistic Regression
NN: Neural Networks
![Page 57: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/57.jpg)
57
VSL model - Background
![Page 58: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/58.jpg)
58
VSL model
![Page 59: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/59.jpg)
59
VSL model
![Page 60: Statistical Methods for Protein Structure Predictionpredrag/classes/2008springi619/week13.pdf• Protein secondary structure prediction ... Statistical Methods ... M D K V Q Y L T](https://reader034.vdocuments.net/reader034/viewer/2022051601/5ac5a6107f8b9aae1b8e2b54/html5/thumbnails/60.jpg)
60
VSL model