department of computer science, university of california, santa barbara august 11-14, 2003 ctss: a...

20
Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment Based on Local Geometrical and Biological Features Tolga Can and Yuan-Fang Wang

Post on 20-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment

Department of Computer Science, University of California, Santa Barbara

August 11-14, 2003

CTSS: A Robust and Efficient Method for Protein Structure Alignment

Based on Local Geometrical and Biological Features

Tolga Can and Yuan-Fang Wang

Page 2: Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment

2CSB2003, August 11-14, 2003

Introduction

Importance of discovering structural relationships between proteins

Structural Alignment: NP-Hard Protein structure representation: no

standard as in sequence alignment Many algorithms

Inter-atomic Distances (CE, DALI) SSE vectors (VAST, 3D-Lookup)

Different similarity measures RMSD, p-value, etc.

Page 3: Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment

3CSB2003, August 11-14, 2003

Problem Definition

Given a protein structure, find similar protein structures from a database of protein structures.

1fse:A

1jek:B

1alu:_

2spc:A1l3l:C

1k61:D

1kzu:B

1et1:A1jig:A

1wdc:A

1nkd:_

1fmh:A

1gl2:A

?

1l3l:C

1kzu:B

1jig:A

1nkd:_

=

Page 4: Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment

4CSB2003, August 11-14, 2003

Protein Structure?

HEADER PHEROMONE 20-DEC-95 2ERL.................................. SEQRES 1 40 ASP ALA CYS GLU GLN ALA ..................................ATOM 1 N ASP 1 -1.115 8.537 7.075 ATOM 2 CA ASP 1 -1.925 7.470 6.547ATOM 3 C ASP 1 -2.009 6.333 7.522ATOM 4 O ASP 1 -1.467 6.394 8.624ATOM 5 CB ASP 1 -1.526 6.993 5.163ATOM 6 N ALA 2 -2.745 5.280 7.165ATOM 7 CA ALA 2 -2.945 4.152 7.987ATOM 8 C ALA 2 -1.606 3.448 8.305ATOM 9 O ALA 2 -1.440 3.010 9.454ATOM 10 CB ALA 2 -3.966 3.256 7.436ATOM 11 N CYS 3 -0.777 3.267 7.329ATOM 12 CA CYS 3 0.570 2.624 7.511ATOM 13 C CYS 3 1.328 3.308 8.626ATOM 14 O CYS 3 1.802 2.679 9.562ATOM 15 CB CYS 3 1.351 2.667 6.209ATOM 16 SG CYS 3 2.981 1.901 6.318..................................

We use Cα coordinates to represent the protein structure.

PDB File

Page 5: Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment

5CSB2003, August 11-14, 2003

Protein Structure

HEADER PHEROMONE 20-DEC-95 2ERL.................................. SEQRES 1 40 ASP ALA CYS GLU GLN ALA ..................................ATOM 1 N ASP 1 -1.115 8.537 7.075 ATOM 2 CA ASP 1 -1.925 7.470 6.547ATOM 3 C ASP 1 -2.009 6.333 7.522ATOM 4 O ASP 1 -1.467 6.394 8.624ATOM 5 CB ASP 1 -1.526 6.993 5.163ATOM 6 N ALA 2 -2.745 5.280 7.165ATOM 7 CA ALA 2 -2.945 4.152 7.987ATOM 8 C ALA 2 -1.606 3.448 8.305ATOM 9 O ALA 2 -1.440 3.010 9.454ATOM 10 CB ALA 2 -3.966 3.256 7.436ATOM 11 N CYS 3 -0.777 3.267 7.329ATOM 12 CA CYS 3 0.570 2.624 7.511ATOM 13 C CYS 3 1.328 3.308 8.626ATOM 14 O CYS 3 1.802 2.679 9.562ATOM 15 CB CYS 3 1.351 2.667 6.209ATOM 16 SG CYS 3 2.981 1.901 6.318..................................

The Cα coordinates of a protein define a curve in 3D space.

PDB File

Page 6: Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment

6CSB2003, August 11-14, 2003

Spline Approximation

HEADER PHEROMONE 20-DEC-95 2ERL.................................. SEQRES 1 40 ASP ALA CYS GLU GLN ALA ..................................ATOM 1 N ASP 1 -1.115 8.537 7.075 ATOM 2 CA ASP 1 -1.925 7.470 6.547ATOM 3 C ASP 1 -2.009 6.333 7.522ATOM 4 O ASP 1 -1.467 6.394 8.624ATOM 5 CB ASP 1 -1.526 6.993 5.163ATOM 6 N ALA 2 -2.745 5.280 7.165ATOM 7 CA ALA 2 -2.945 4.152 7.987ATOM 8 C ALA 2 -1.606 3.448 8.305ATOM 9 O ALA 2 -1.440 3.010 9.454ATOM 10 CB ALA 2 -3.966 3.256 7.436ATOM 11 N CYS 3 -0.777 3.267 7.329ATOM 12 CA CYS 3 0.570 2.624 7.511ATOM 13 C CYS 3 1.328 3.308 8.626ATOM 14 O CYS 3 1.802 2.679 9.562ATOM 15 CB CYS 3 1.351 2.667 6.209ATOM 16 SG CYS 3 2.981 1.901 6.318..................................

We smooth the Cα curve based on secondary structure information.

PDB File

Page 7: Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment

7CSB2003, August 11-14, 2003

Spline Approximation

HEADER PHEROMONE 20-DEC-95 2ERL.................................. SEQRES 1 40 ASP ALA CYS GLU GLN ALA ..................................ATOM 1 N ASP 1 -1.115 8.537 7.075 ATOM 2 CA ASP 1 -1.925 7.470 6.547ATOM 3 C ASP 1 -2.009 6.333 7.522ATOM 4 O ASP 1 -1.467 6.394 8.624ATOM 5 CB ASP 1 -1.526 6.993 5.163ATOM 6 N ALA 2 -2.745 5.280 7.165ATOM 7 CA ALA 2 -2.945 4.152 7.987ATOM 8 C ALA 2 -1.606 3.448 8.305ATOM 9 O ALA 2 -1.440 3.010 9.454ATOM 10 CB ALA 2 -3.966 3.256 7.436ATOM 11 N CYS 3 -0.777 3.267 7.329ATOM 12 CA CYS 3 0.570 2.624 7.511ATOM 13 C CYS 3 1.328 3.308 8.626ATOM 14 O CYS 3 1.802 2.679 9.562ATOM 15 CB CYS 3 1.351 2.667 6.209ATOM 16 SG CYS 3 2.981 1.901 6.318..................................

We smooth the Cα curve based on secondary structure information.

Helix TurnPDB File

Page 8: Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment

8CSB2003, August 11-14, 2003

Matching Two Curves

Are they similar?

Page 9: Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment

9CSB2003, August 11-14, 2003

Curvature and Torsion• Curvature: • Torsion:

If two single-valued continuous functions (s) and (s) are given for s > 0, then there exists exactly one space curve, determined except for orientation and position in space (i.e., up to a Euclidian

motion), where s is the intrinsic arc length, is the curvature, and is the torsion.

• Fundamental Theorem of Space Curves:

Measure of how far the curve deviates from being planar

Measure of how far the curve deviates from being linear

Page 10: Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment

10CSB2003, August 11-14, 2003

Curvature and Torsion• They are invariant to rotation and translation.• They are localized.

0

0.02

0.04

0.06

0.08

0.1

0.12

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Curvature

-8.00E-02

-6.00E-02

-4.00E-02

-2.00E-02

0.00E+00

2.00E-02

4.00E-02

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Torsion

Page 11: Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment

11CSB2003, August 11-14, 2003

Feature Extraction• For each amino acid a (Curvature, Torsion) tuple is computed and Secondary Structure assignment information from PDB web site is gathered• This constitutes a 3D feature vector of length n, where n is the number of amino acids in the protein

+Curvature

To

rsio

n

Secondary Structure Information (3rd dimension not shown above)

0

50

100

150

200

250

0 50 100 150 200 250

Page 12: Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment

12CSB2003, August 11-14, 2003

0

50

100

150

200

250

0 50 100 150 200 250

Indexing the Features

• Why is indexing necessary?• Hash Table (show in 2D below, 3rd Dimension is the SSE type)

To

rsio

n

Curvature

A Hash Bin

Page 13: Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment

13CSB2003, August 11-14, 2003

Query Execution

Hierarchical approach: Pruning before detailed pairwise alignment

hash table

Accumulate vote voteprotein++

Normalize vote voteprotein/lengthprotein

Threshold

Page 14: Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment

14CSB2003, August 11-14, 2003

Query Execution

• Pairwise alignment by Smith-Waterman dynamic programming technique performed after screening process:

Distance Matrix

SW

1fse:A

1l3l

:C

Gap

length:63 RMSD:1.61 Ao

Page 15: Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment

15CSB2003, August 11-14, 2003

SW Alignment Result

1fse:A

1l3l:C

Page 16: Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment

16CSB2003, August 11-14, 2003

Sample Query Results• Query: 1faz:A, database: 1938 protein chains

•Screening time: 18 seconds•Pairwise Alignment time: 29 seconds

length:42 RMSD:2.8 Ao

1faz:A &1ytf:D

length:38 RMSD:3.68 Ao

1faz:A &1dj7:A

Page 17: Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment

17CSB2003, August 11-14, 2003

Sample Query Results• Query: 1b16:A, database: 1938 protein chains

•Screening time: 25 seconds•Pairwise Alignment time: 68 seconds

length:35 RMSD:3.26 Ao

1b16:A &1h05:A

length:35 RMSD:1.58 Ao

1b16:A &1qp8:A

Page 18: Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment

18CSB2003, August 11-14, 2003

Current and Future Work

Evaluation of Accuracy

Comparison with SCOP classification

Efficiency Comparison with other techniques like CE, or DALI

Better index structures Faster and more accurate screening of

candidates Incorporating biological, chemical

properties of amino acids to the structure signatures of proteins.

Page 19: Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment

19CSB2003, August 11-14, 2003

Conclusions

A new method for protein structure alignment is presented: Extracted structural features are:

Compact: O(n) Localized: computed for each amino acid Robust: error handling by spline approximation Invariant: suitable for indexing Meaningful: Biological, chemical properties can be

incorporated easily

An indexing technique is deployed to avoid exhaustive scan of the structure database

Experiment results show that this method is suitable for finding structural motifs.

Page 20: Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment

20CSB2003, August 11-14, 2003

Thank you for your attention!

Tolga CanDepartment of Computer Science University of California at Santa BarbaraSanta Barbara, CA 93106, U.S.

Email: [email protected]: http://www.cs.ucsb.edu/~tcan/CTSS/

For More Information: