approximation of protein structure for fast similarity measures
DESCRIPTION
Approximation of Protein Structure for Fast Similarity Measures. Itay Lotan Fabian Schwarzer. Comparing Protein Structures. Same protein:. vs. Analysis of MDS and MCS trajectories. Graph-based Methods. Structure prediction applications. Evaluating decoy sets - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/1.jpg)
Approximation of Protein Structure for Fast Similarity
Measures
Itay LotanFabian Schwarzer
![Page 2: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/2.jpg)
Comparing Protein Structures
vs.Same protein:
Analysis of MDS
and MCS
trajectories
http://folding.stanford.edu
Structure prediction applications• Evaluating decoy sets
• Clustering predictions (Shortle et al, Biophysics ’98)
Graph-based Methods
Stochastic Roadmap Simulation (Apaydin et al, RECOMB ’02)
![Page 3: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/3.jpg)
k Nearest-Neighbors Problem
Given a set S of conformations of a protein and a query conformation c, find the k conformations in S most similar to c.
Can be done in
N – size of S
L – time to compare two conformations
(log )O N k L
![Page 4: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/4.jpg)
k Nearest-Neighbors Problem
What if needed for all c in S ?
2 (log )O N k L - too much time
Can be improved by:
1. Reducing L
2. A more efficient algorithm
![Page 5: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/5.jpg)
Our Solution
Reduce structure description
Approximate but fast similarity measures
Efficient nearest-neighbor algorithms can be used
Reduce description further
![Page 6: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/6.jpg)
Description of a Protein’s Structure
3n coordinates of Cα atoms (n – Number of residues)
![Page 7: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/7.jpg)
m-Averaged Approximation Cut chain into pieces of length m Replace each sequence of m Cα
atoms by its centroid
3n coordinates
3n/m coordinates
![Page 8: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/8.jpg)
Similarity Measures - cRMS
The RMS of the distances between corresponding atoms after the two conformations are optimally aligned
2
21
1( , ) min
n
T i ii
cRMS P Q p Tqn
Computed in O(n) time
![Page 9: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/9.jpg)
Similarity Measures - dRMS
The Euclidean distance between the intra-molecular distances matrices of the two conformations
2
2 1
2( , )
( 1)
n iP Qij ij
i j
dRMS P Q d dn n
Computed in O(n2) time
![Page 10: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/10.jpg)
1. Decoy sets: conformations from the Park-Levitt set (Park et al, JMB ’97), N =10,000
2. Random sets: conformations generated by the program FOLDTRAJ (Feldman & Hogue, Proteins ’00), N = 5000
Evaluation: Test Sets
8 structurally diverse proteins of size 54 -76 residues:
![Page 11: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/11.jpg)
Decoy Sets Correlation
m cRMS dRMS
4
6
9
12
3 0.99
0.98 – 0.990.92 – 0.990.81 – 0.980.54 – 0.92
0.96 – 0.980.94 – 0.970.78 – 0.930.65 – 0.960.52 – 0.69Higher Correlation for random
sets!
![Page 12: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/12.jpg)
Speed-up for Decoy Sets
9x for cRMS (m = 9) 36x for dRMS (m = 6)with very small error
For random sets the speed-up for dRMS goes up to 81x (m = 9)
![Page 13: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/13.jpg)
Efficient Nearest-Neighbor Algorithms
There are efficient nearest-neighbor algorithms, but they are not compatible with similarity measures:
cRMS is not a Euclidean metric
dRMS uses a space of dimensionality n(n-1)/2
![Page 14: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/14.jpg)
Further Dimensionality Reduction of dRMS
kd-trees require dimension 20m-averaging with dRMS is not enough
Reduce further using SVD
SVD: A tool for principal component analysis. Computes directions of greatest variance.
![Page 15: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/15.jpg)
Reduction Using SVD
1. Stack m-averaged distance matrices as vectors
2. Compute the SVD of entire set3. Project onto most important
singular vectors
dRMS is thus reduced to 20 dimensionsWithout m-averaging SVD can be too costly
![Page 16: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/16.jpg)
Testing the Method
Use decoy sets (N = 10,000) and random sets (N = 5,000)
m-averaging with (m = 4) Project onto 16 PCs for decoys, 12
PCs for random sets Find k = 10, 25, 100 NNs for 250
conformations in each set
![Page 17: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/17.jpg)
Results Decoy sets:
~77% correct Furthest approximate NN off by 10% - 15% ~4k approximate NNs contain all true k NNs
Random sets: 71%, 76%, 84% correct respectively Furthest approximate NN off by 5% - 10% ~3k approximate NNs contain all true k NNs
![Page 18: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/18.jpg)
More Results: N = 100,000
1CTF decoys: ~70% correct Furthest approximate NN off by ~20% ~6k approximate NNs contain all true k NNs
1CTF random: 46%, 48%, 60% correct respectively Furthest approximate NN off by ~16% ~7k approximate NNs contain all true k NNs
![Page 19: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/19.jpg)
Running Time
N = 100,000, m=4, PC = 16
Find k = 100 for each conformation
Brute-force: ~84 hoursBrute-force + m-averaging: ~4.8 hoursBrute-force + m-averaging + SVD: 41
minutesKd-tree + m-averaging + SVD: 19 minutes
kd-trees will have more impact for larger sets
![Page 20: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/20.jpg)
Structural ClassificationComputing the similarity between structures of two different proteins is more involved:
The correspondence problem:
Which parts of the two structures should be compared?
1IRD 2MM1
vs.
![Page 21: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/21.jpg)
STRUCTAL (Subbiah et al, ’93)
1. Compute optimal correspondence using dynamic programming
2. Optimally align the corresponding parts in space to minimize cRMS
3. Repeat until convergence
O(n1n2) time
![Page 22: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/22.jpg)
STRUCTAL + m-averaging 256 protein domains (180 – 420 res) 3691 good matches (Sandelin’s PROTOFARM) 6375 random pairs Compute SAS scores (cRMS/length*100)
m correlation3
4
5
0.81
0.77
0.70
speed-up~9x
~16x~25x
![Page 23: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/23.jpg)
OK (P < 0.005) BAD (P > 0.005)
SAS score
Num
ber
of
pair
s
![Page 24: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/24.jpg)
![Page 25: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/25.jpg)
MisclassificationP ≤ 0.005 P ≤ 0.001
Total FP FN Total FP FN
m=1410(4%)
176 225457
(4.5%)
215 242
m=3802(8%)
274 528660
(6.5%)
262 398
m=4993
(10%)248 745
870(8.5%
)282 588
m=51602(16%)
186 14161129(11%)
148 981
![Page 26: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/26.jpg)
Random Chains
c0c1
c2
c3c4
c5
cn-1c6
c7
c8
The dimensions are uncorrelated Average behavior can be
approximated by normal variables:
1 (0,1)i i N l c c
![Page 27: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/27.jpg)
1-D Haar Wavelet Transform
Recursive averaging and differencing of the values
Level AveragesDetail Coefficients
[ 9 7 2 6 5 1 4 6 ]
[ 8 4 3 5 ]
[ 6 4 ]
[ 5 ]
[ 1 -2 2 -1 ]
[ -2 -1 ]
[ 1 ]
3
2
1
0
[ 9 7 2 6 5 1 4 6 ]
[ 5 1 -2 -1 1 -2 2 1 ]
![Page 28: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/28.jpg)
pdf of the detail coefficients is:
Coefficients expected to be ordered!
Discard coefficients starting at lowest level
Transform of Random Chains
m-averaging
Discarding lowest levels of detail coeeficients
logm
( ) 0, 4j jd N O
![Page 29: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/29.jpg)
Random Chains and Proteins
Protein backbones behave on average like random chains Chain topology Limited compactness
![Page 30: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/30.jpg)
![Page 31: Approximation of Protein Structure for Fast Similarity Measures](https://reader036.vdocuments.net/reader036/viewer/2022070402/568137a6550346895d9f504b/html5/thumbnails/31.jpg)
Conclusion
Fast computation of similarity measures
Trade-off between speed and precision Exploits chain topology of proteins and
limited compactness Allows use of efficient nearest-neighbor
algorithms Can be used as filter when precision is
important