department of computer science, university of california, santa barbara august 11-14, 2003 ctss: a...
Post on 20-Dec-2015
215 views
TRANSCRIPT
Department of Computer Science, University of California, Santa Barbara
August 11-14, 2003
CTSS: A Robust and Efficient Method for Protein Structure Alignment
Based on Local Geometrical and Biological Features
Tolga Can and Yuan-Fang Wang
2CSB2003, August 11-14, 2003
Introduction
Importance of discovering structural relationships between proteins
Structural Alignment: NP-Hard Protein structure representation: no
standard as in sequence alignment Many algorithms
Inter-atomic Distances (CE, DALI) SSE vectors (VAST, 3D-Lookup)
Different similarity measures RMSD, p-value, etc.
3CSB2003, August 11-14, 2003
Problem Definition
Given a protein structure, find similar protein structures from a database of protein structures.
1fse:A
1jek:B
1alu:_
2spc:A1l3l:C
1k61:D
1kzu:B
1et1:A1jig:A
1wdc:A
1nkd:_
1fmh:A
1gl2:A
?
1l3l:C
1kzu:B
1jig:A
1nkd:_
=
4CSB2003, August 11-14, 2003
Protein Structure?
HEADER PHEROMONE 20-DEC-95 2ERL.................................. SEQRES 1 40 ASP ALA CYS GLU GLN ALA ..................................ATOM 1 N ASP 1 -1.115 8.537 7.075 ATOM 2 CA ASP 1 -1.925 7.470 6.547ATOM 3 C ASP 1 -2.009 6.333 7.522ATOM 4 O ASP 1 -1.467 6.394 8.624ATOM 5 CB ASP 1 -1.526 6.993 5.163ATOM 6 N ALA 2 -2.745 5.280 7.165ATOM 7 CA ALA 2 -2.945 4.152 7.987ATOM 8 C ALA 2 -1.606 3.448 8.305ATOM 9 O ALA 2 -1.440 3.010 9.454ATOM 10 CB ALA 2 -3.966 3.256 7.436ATOM 11 N CYS 3 -0.777 3.267 7.329ATOM 12 CA CYS 3 0.570 2.624 7.511ATOM 13 C CYS 3 1.328 3.308 8.626ATOM 14 O CYS 3 1.802 2.679 9.562ATOM 15 CB CYS 3 1.351 2.667 6.209ATOM 16 SG CYS 3 2.981 1.901 6.318..................................
We use Cα coordinates to represent the protein structure.
PDB File
5CSB2003, August 11-14, 2003
Protein Structure
HEADER PHEROMONE 20-DEC-95 2ERL.................................. SEQRES 1 40 ASP ALA CYS GLU GLN ALA ..................................ATOM 1 N ASP 1 -1.115 8.537 7.075 ATOM 2 CA ASP 1 -1.925 7.470 6.547ATOM 3 C ASP 1 -2.009 6.333 7.522ATOM 4 O ASP 1 -1.467 6.394 8.624ATOM 5 CB ASP 1 -1.526 6.993 5.163ATOM 6 N ALA 2 -2.745 5.280 7.165ATOM 7 CA ALA 2 -2.945 4.152 7.987ATOM 8 C ALA 2 -1.606 3.448 8.305ATOM 9 O ALA 2 -1.440 3.010 9.454ATOM 10 CB ALA 2 -3.966 3.256 7.436ATOM 11 N CYS 3 -0.777 3.267 7.329ATOM 12 CA CYS 3 0.570 2.624 7.511ATOM 13 C CYS 3 1.328 3.308 8.626ATOM 14 O CYS 3 1.802 2.679 9.562ATOM 15 CB CYS 3 1.351 2.667 6.209ATOM 16 SG CYS 3 2.981 1.901 6.318..................................
The Cα coordinates of a protein define a curve in 3D space.
PDB File
6CSB2003, August 11-14, 2003
Spline Approximation
HEADER PHEROMONE 20-DEC-95 2ERL.................................. SEQRES 1 40 ASP ALA CYS GLU GLN ALA ..................................ATOM 1 N ASP 1 -1.115 8.537 7.075 ATOM 2 CA ASP 1 -1.925 7.470 6.547ATOM 3 C ASP 1 -2.009 6.333 7.522ATOM 4 O ASP 1 -1.467 6.394 8.624ATOM 5 CB ASP 1 -1.526 6.993 5.163ATOM 6 N ALA 2 -2.745 5.280 7.165ATOM 7 CA ALA 2 -2.945 4.152 7.987ATOM 8 C ALA 2 -1.606 3.448 8.305ATOM 9 O ALA 2 -1.440 3.010 9.454ATOM 10 CB ALA 2 -3.966 3.256 7.436ATOM 11 N CYS 3 -0.777 3.267 7.329ATOM 12 CA CYS 3 0.570 2.624 7.511ATOM 13 C CYS 3 1.328 3.308 8.626ATOM 14 O CYS 3 1.802 2.679 9.562ATOM 15 CB CYS 3 1.351 2.667 6.209ATOM 16 SG CYS 3 2.981 1.901 6.318..................................
We smooth the Cα curve based on secondary structure information.
PDB File
7CSB2003, August 11-14, 2003
Spline Approximation
HEADER PHEROMONE 20-DEC-95 2ERL.................................. SEQRES 1 40 ASP ALA CYS GLU GLN ALA ..................................ATOM 1 N ASP 1 -1.115 8.537 7.075 ATOM 2 CA ASP 1 -1.925 7.470 6.547ATOM 3 C ASP 1 -2.009 6.333 7.522ATOM 4 O ASP 1 -1.467 6.394 8.624ATOM 5 CB ASP 1 -1.526 6.993 5.163ATOM 6 N ALA 2 -2.745 5.280 7.165ATOM 7 CA ALA 2 -2.945 4.152 7.987ATOM 8 C ALA 2 -1.606 3.448 8.305ATOM 9 O ALA 2 -1.440 3.010 9.454ATOM 10 CB ALA 2 -3.966 3.256 7.436ATOM 11 N CYS 3 -0.777 3.267 7.329ATOM 12 CA CYS 3 0.570 2.624 7.511ATOM 13 C CYS 3 1.328 3.308 8.626ATOM 14 O CYS 3 1.802 2.679 9.562ATOM 15 CB CYS 3 1.351 2.667 6.209ATOM 16 SG CYS 3 2.981 1.901 6.318..................................
We smooth the Cα curve based on secondary structure information.
Helix TurnPDB File
8CSB2003, August 11-14, 2003
Matching Two Curves
Are they similar?
9CSB2003, August 11-14, 2003
Curvature and Torsion• Curvature: • Torsion:
If two single-valued continuous functions (s) and (s) are given for s > 0, then there exists exactly one space curve, determined except for orientation and position in space (i.e., up to a Euclidian
motion), where s is the intrinsic arc length, is the curvature, and is the torsion.
• Fundamental Theorem of Space Curves:
Measure of how far the curve deviates from being planar
Measure of how far the curve deviates from being linear
10CSB2003, August 11-14, 2003
Curvature and Torsion• They are invariant to rotation and translation.• They are localized.
0
0.02
0.04
0.06
0.08
0.1
0.12
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Curvature
-8.00E-02
-6.00E-02
-4.00E-02
-2.00E-02
0.00E+00
2.00E-02
4.00E-02
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Torsion
11CSB2003, August 11-14, 2003
Feature Extraction• For each amino acid a (Curvature, Torsion) tuple is computed and Secondary Structure assignment information from PDB web site is gathered• This constitutes a 3D feature vector of length n, where n is the number of amino acids in the protein
+Curvature
To
rsio
n
Secondary Structure Information (3rd dimension not shown above)
0
50
100
150
200
250
0 50 100 150 200 250
12CSB2003, August 11-14, 2003
0
50
100
150
200
250
0 50 100 150 200 250
Indexing the Features
• Why is indexing necessary?• Hash Table (show in 2D below, 3rd Dimension is the SSE type)
To
rsio
n
Curvature
A Hash Bin
13CSB2003, August 11-14, 2003
Query Execution
Hierarchical approach: Pruning before detailed pairwise alignment
hash table
Accumulate vote voteprotein++
Normalize vote voteprotein/lengthprotein
Threshold
14CSB2003, August 11-14, 2003
Query Execution
• Pairwise alignment by Smith-Waterman dynamic programming technique performed after screening process:
Distance Matrix
SW
1fse:A
1l3l
:C
Gap
length:63 RMSD:1.61 Ao
15CSB2003, August 11-14, 2003
SW Alignment Result
1fse:A
1l3l:C
16CSB2003, August 11-14, 2003
Sample Query Results• Query: 1faz:A, database: 1938 protein chains
•Screening time: 18 seconds•Pairwise Alignment time: 29 seconds
length:42 RMSD:2.8 Ao
1faz:A &1ytf:D
length:38 RMSD:3.68 Ao
1faz:A &1dj7:A
17CSB2003, August 11-14, 2003
Sample Query Results• Query: 1b16:A, database: 1938 protein chains
•Screening time: 25 seconds•Pairwise Alignment time: 68 seconds
length:35 RMSD:3.26 Ao
1b16:A &1h05:A
length:35 RMSD:1.58 Ao
1b16:A &1qp8:A
18CSB2003, August 11-14, 2003
Current and Future Work
Evaluation of Accuracy
Comparison with SCOP classification
Efficiency Comparison with other techniques like CE, or DALI
Better index structures Faster and more accurate screening of
candidates Incorporating biological, chemical
properties of amino acids to the structure signatures of proteins.
19CSB2003, August 11-14, 2003
Conclusions
A new method for protein structure alignment is presented: Extracted structural features are:
Compact: O(n) Localized: computed for each amino acid Robust: error handling by spline approximation Invariant: suitable for indexing Meaningful: Biological, chemical properties can be
incorporated easily
An indexing technique is deployed to avoid exhaustive scan of the structure database
Experiment results show that this method is suitable for finding structural motifs.
20CSB2003, August 11-14, 2003
Thank you for your attention!
Tolga CanDepartment of Computer Science University of California at Santa BarbaraSanta Barbara, CA 93106, U.S.
Email: [email protected]: http://www.cs.ucsb.edu/~tcan/CTSS/
For More Information: