finding similarities in protein structures: a string approach ifbm 2004 atelier de bio-informatique...

Finding similarities in protein structures:

a string approach

IFBM 2004

• Atelier de Bio-Informatique (ABI) - Université Paris VI

Searching for strict repeated patterns : KMR (1972)

- Occurrences of strictly repeated 2k-length patterns are built from intersections of sets of strictly repeated k-length patterns which lie side-by-side

This leads to an O(n.log(kmax)) algorithm for finding kmax-length repeated patterns

k

2k

k k

2k

k

Flexible patterns : KMRC

• A pattern is no more defined as a succession of symbols, instead it is a succession of cliques of symbols

S = caaaabaaacb

Here m is a repeated flexible pattern of length 3.m = c1-c1-c2 at position 4 and 8

Note : several patterns may exist at the same position : here c1-c1-c1 has also an occurrence in position 4

a b cc1

c2 2 2sC = 21111111121


• As in KMR, the 2-k length patterns are built from k-length patterns.

• Here, a pattern is a clique of (similar) patterns, and at one position in the string there may exist several cliques of patterns.

The algorithm is now O(n.log(k).gk) (g being the mean degeneracy, i.e. the mean number of cliques a symbol belongs to)

In biology, identity is trivial, similarity is interesting…

Flexible patterns in sequences (KMRC)

• This algorithm may be used to find flexible patterns in several protein sequences (multiple alignment by blocks):

• Cliques of “symbols” define similarity, e.g :- different overlapping sets of amino acids

clustered by their properties (e.g. hydrophobic, hydrophilic, small, large, polar, charged, etc…)

- different overlapping sets of amino acids clustered by setting a threshold value on their score in a similarity matrix (e.g. BLOSUM or PAM)

p1 p2 p3 p4 p5

PAM250

Threshold -> cliques

Similarity is not

transitive !

Flexible patterns in structures (KMRC)

• Finding 3D structural patterns in several protein structures

• Structures must be described as strings of symbols, and similar structures must be composed of similar symbols

• -> use of discretized internal coordinates

(angles) as an alphabet : or angles

Internal coordinates

Internal coordinates to symbols

……

……

discretization

Absolute need of similarity (KMRC), not identity !


Finding flexible patterns of -symbols :

Here, cliques of “symbols” are angle overlapping sets

Similarity is a critical point (identity would miss structural features).

-180° 180°0°


3 CytochromesP450:TERP,BM3,CAM

PMWIATKHADVMQIGVTRYLSSQRLIKEACGHWIATRGQLIREAY

PTHTAYRGLTLNWFQPASIRKLEENIRRIAQASVQRKNWKKAHNILLPSFSQQAMKGYHAMMVDIAVQLVQKPEQRQFRALANQVVGMPVVDKLENRIQELACSLIES

CDFMTDCALYYPLHVVMTALGVPIEVPEDMTRLTLDTIGLCGFNYRCNFTEDYAEPFPIRIFMLLAGLP

EDDEPLMLKLTQDFITSMVRALDEAMNKEDIPHLKYLTDQMT

FHETIATFYDYFNGFTVDRRSFQEDIKVMNDLVDKIIADRKAFAEAKEALYDYLIPIIEQRRQ

CPKDDVMSLLANEQSDDLLTHMLNKPGTDAISIVAN

Bach BWV846

Similarity : nature of elements OR relations between them

Bach BWV846

Similarity: series of notes with same pitch ?-> transposed series rather…

Bach BWV846

-> Similarity of relations between elements

Relational patterns (KMRR)• Now, a similar pattern is not defined by a

succession of similar symbols, but instead by succession of elements that share the same relationships between them.

r23r12

r13

r23r12

r13

r13

r12 r23

Pattern m =Example of relations = “to be higher to”,

“to be lower to”, “to be equal to”,…

One step further : Flexible relational patterns(KMRR)

• The relations between elements do not need to be the same, they just need to be similar

rbra

rc

rara

rb

CR2CR1 CR1

Pattern m =ra rb rc

CR1 CR2

Relational cliques Cliques of relations = {“to be higher”, “to be equal”}, {“to be lower”, “to be equal”}…

An example: application to 3D structures:

relations are defined on discrete internal distances between points:r(i,j) = rk if and only if rk ≤ dist(i,j) < rk + ∆

The relations r(i,j) and r(i’,j’) are considered as similarif they belong to the same subset {rk, rk+1, rk+2}, i.e. if

| r(i, j) - r(i’, j’) | ≤ 2This implies for euclidian distances :| dist(i,j) - dist(i’,j’) | < 3∆

rkr1 r2 r3 rk+1 rk+2

d(i,j) d(i ’,j’)

Application to 3D structures:

Relational cliques: 1 2 3 4 5 6 7 8 9 10 ….(defined on distances)

p4

5

3

4

3

3

44

3

2

44 6p3

p2

p1

p5 p8 p9

p7p6

r(p1,p4)=5r(p1,p3)=4 r(p2,p4)=3r(p1,p2)=3 r(p2,p3)=4 r(p3,p4)=3

r(p6,p9)=6r(p6,p8)=3 r(p7,p9)=4r(p6,p7)=4 r(p7,p8)=4 r(p8,p9)=2

Application to3D structures:

1PCL: AEWDAAVIDNSTNVWVDHVT1IDJ: WGGDAITLDDCDLVWIDHVT2BSP: SQYDNITINGGTHIWIDHCT1PLU: KDGDMIRVDDSPNVWVDHNE

1PCL: LRVTFHNNVFDRVTERAPRV1IDJ: DLVTMKGNYIYHTSGRSPKV2BSP: LKITLHHNRYKNIVQKAPRV1PLU: RNITYHHNYYNDVNARLPLQ

1PCL: TERAPRVRFGSIHAYNNVYL1IDJ: GRSPKVQDNTLLHCVNNYFY2BSP: VQKAPRVRFGQVHVYNNYYE1PLU: NARLPLQRGGLVHAYNNLYT

1PCL: AQTMTSSLATSINNNAGYGK1IDJ: SASAYTSVASRVVANAGQGN2BSP: SIDASANVKSNVINQAGAGK1PLU: SPVSAQCVKDKLPGYAGVGK

Ex: multiple structural alignment of 4 pectate and pectin lyases: 1PCL,1IDJ,2BSP,1PLU

Structural similarity search in databases: YAKUSA

fast structuralscanning

query

database (PDB)

Structural similarity search in databases: YAKUSA

Structures encoded with angles

For each database protein:1- find structural similar seeds with the query2 - extend seeds to longer structural matches

Then rank the structural hits

Seeds of length kAll overlapping k-patterns of the query -> automaton

1

2

2

2

1

2 1

1

Dictionary/Automaton

2

2 1Leaf->one seedAdvantage: no moving backward in the database string; less moving backward in the patterns

Query seeds automaton

1

2

2

2

1

2 1

1

Dictionary/Automaton

dc(i, ’i) <

'1

'22'

2

'2

'1

1

'1

1

2

'1

2'2

'2

'1 1

'1

'2

'1

'2

'11

2'

2 2

'2

dc(i,’i) <

with degeneracy with similar patterns

Matching seeds to SHSP

…

…

seed

SHSP

- (many seeds, giving only several SHSP)

query

database

- SHSP=maximal scoring region around the seed, the scores being based upon the angular differences

- probability associated with SHSP (MTD: pair approximation of Markov model)

SHSPs to results…

Database structures

found, ranked by

score

http://bioserv.rpbs.jussieu.fr/Yakusa

40 seconds for scanning 11000 PDB structures

query (Deoxyribonuclease I)

database(Heat-Shock protein 70)

http://bioserv.rpbs.jussieu.fr/Yakusa

SHSPs

Example of output:

selection of best 2-diagonals

Multiple structural alignment: « m-diagonals »

First step: finding pair “diagonals” (at the angle level)

Index in first protein

Ind

ex in s

eco

nd

pro

tein

Second step: combination of 2-diagonals into m-diagonals (in m dimensions).

Protein 1Protein 1

Protein 2Protein 2

Protein 3Protein 3m-diagonal in m dimensions : here, the 3-diagonal is the combination of three 2-diagonals of dimension 2.


Protein 1Protein 1

Protein 3Protein 3

Protein 2Protein 2

2

3

5


m-diagonal in m dimensions : here, the 3-diagonal is the combination of three 2-diagonals of dimension 2.


“column” graphes : - a residu is a node- if two residues are in a

2-diagonal, they are connected by a link.

Protein 1Protein 1

Protein 2Protein 2

Protein 3Protein 3

Selection of best m-diagonals (most connected ones)

2

3

5



13 cytochromes P450

- In blue : non aligned parts

- Other colors : m-diagonals

Example:

Other method : « Gibbs sampling »

« Taboo search »

Alain ViariHenry Soldano

Mathilde Carpentier

Marie-France Sagot

Pascale Jean

Sophie Brouillet

Nadia Pisanti

Kmrc+gok

Cytochromes P450

Yakusam-diagonalsGibbs-taboo

relationalpatterns

Near future: classification of structural cores in PDB…

a

b

7 5 4 2 1

6 3

1 2 3 4 5 6 7

a a b a a b a

P1

V1

_a

_b 2 5 ;

1 4 ; 6 3

aa

ba

4 1

6 3

P2

ab 5 2

V2

Q2

aa

k=1

1 2 3 4 5 6 7

aaab ba ab ba

Look at +k

Put position in set stack

k=2

2k-length stacks

k-length stacks

finding similarities in protein structures: a string approach ifbm 2004 atelier de bio-informatique...

Documents