finding similarities in protein structures: a string approach ifbm 2004 atelier de bio-informatique...

34
Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Upload: todd-higgins

Post on 14-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Finding similarities in protein structures:

a string approach

IFBM 2004

• Atelier de Bio-Informatique (ABI) - Université Paris VI

Page 2: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Searching for strict repeated patterns : KMR (1972)

- Occurrences of strictly repeated 2k-length patterns are built from intersections of sets of strictly repeated k-length patterns which lie side-by-side

This leads to an O(n.log(kmax)) algorithm for finding kmax-length repeated patterns

k

2k

k k

2k

k

Page 3: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Flexible patterns : KMRC

• A pattern is no more defined as a succession of symbols, instead it is a succession of cliques of symbols

S = caaaabaaacb

Here m is a repeated flexible pattern of length 3.m = c1-c1-c2 at position 4 and 8

Note : several patterns may exist at the same position : here c1-c1-c1 has also an occurrence in position 4

a b cc1

c2 2 2sC = 21111111121

Page 4: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Flexible patterns : KMRC

• As in KMR, the 2-k length patterns are built from k-length patterns.

• Here, a pattern is a clique of (similar) patterns, and at one position in the string there may exist several cliques of patterns.

The algorithm is now O(n.log(k).gk) (g being the mean degeneracy, i.e. the mean number of cliques a symbol belongs to)

In biology, identity is trivial, similarity is interesting…

Page 5: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Flexible patterns in sequences (KMRC)

• This algorithm may be used to find flexible patterns in several protein sequences (multiple alignment by blocks):

• Cliques of “symbols” define similarity, e.g :- different overlapping sets of amino acids

clustered by their properties (e.g. hydrophobic, hydrophilic, small, large, polar, charged, etc…)

- different overlapping sets of amino acids clustered by setting a threshold value on their score in a similarity matrix (e.g. BLOSUM or PAM)

p1 p2 p3 p4 p5

Page 6: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

PAM250

Threshold -> cliques

Similarity is not

transitive !

Page 7: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Flexible patterns in structures (KMRC)

• Finding 3D structural patterns in several protein structures

• Structures must be described as strings of symbols, and similar structures must be composed of similar symbols

• -> use of discretized internal coordinates

(angles) as an alphabet : or angles

Page 8: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Internal coordinates

Page 9: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Internal coordinates to symbols

……

……

discretization

Absolute need of similarity (KMRC), not identity !

Page 10: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Flexible patterns : KMRC

Finding flexible patterns of -symbols :

Here, cliques of “symbols” are angle overlapping sets

Similarity is a critical point (identity would miss structural features).

-180° 180°0°

Page 11: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Flexible patterns : KMRC

3 CytochromesP450:TERP,BM3,CAM

PMWIATKHADVMQIGVTRYLSSQRLIKEACGHWIATRGQLIREAY

PTHTAYRGLTLNWFQPASIRKLEENIRRIAQASVQRKNWKKAHNILLPSFSQQAMKGYHAMMVDIAVQLVQKPEQRQFRALANQVVGMPVVDKLENRIQELACSLIES

CDFMTDCALYYPLHVVMTALGVPIEVPEDMTRLTLDTIGLCGFNYRCNFTEDYAEPFPIRIFMLLAGLP

EDDEPLMLKLTQDFITSMVRALDEAMNKEDIPHLKYLTDQMT

FHETIATFYDYFNGFTVDRRSFQEDIKVMNDLVDKIIADRKAFAEAKEALYDYLIPIIEQRRQ

CPKDDVMSLLANEQSDDLLTHMLNKPGTDAISIVAN

Page 12: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Bach BWV846

Similarity : nature of elements OR relations between them

Page 13: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Bach BWV846

Similarity: series of notes with same pitch ?-> transposed series rather…

Page 14: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Bach BWV846

-> Similarity of relations between elements

Page 15: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Relational patterns (KMRR)• Now, a similar pattern is not defined by a

succession of similar symbols, but instead by succession of elements that share the same relationships between them.

r23r12

r13

r23r12

r13

r13

r12 r23

Pattern m =Example of relations = “to be higher to”,

“to be lower to”, “to be equal to”,…

Page 16: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

One step further : Flexible relational patterns(KMRR)

• The relations between elements do not need to be the same, they just need to be similar

rbra

rc

rara

rb

CR2CR1 CR1

Pattern m =ra rb rc

CR1 CR2

Relational cliques Cliques of relations = {“to be higher”, “to be equal”}, {“to be lower”, “to be equal”}…

Page 17: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

An example: application to 3D structures:

relations are defined on discrete internal distances between points:r(i,j) = rk if and only if rk ≤ dist(i,j) < rk + ∆

The relations r(i,j) and r(i’,j’) are considered as similarif they belong to the same subset {rk, rk+1, rk+2}, i.e. if

| r(i, j) - r(i’, j’) | ≤ 2This implies for euclidian distances :| dist(i,j) - dist(i’,j’) | < 3∆

rkr1 r2 r3 rk+1 rk+2

d(i,j) d(i ’,j’)

Page 18: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Application to 3D structures:

Relational cliques: 1 2 3 4 5 6 7 8 9 10 ….(defined on distances)

p4

5

3

4

3

3

44

3

2

44 6p3

p2

p1

p5 p8 p9

p7p6

r(p1,p4)=5r(p1,p3)=4 r(p2,p4)=3r(p1,p2)=3 r(p2,p3)=4 r(p3,p4)=3

r(p6,p9)=6r(p6,p8)=3 r(p7,p9)=4r(p6,p7)=4 r(p7,p8)=4 r(p8,p9)=2

Page 19: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Application to3D structures:

1PCL: AEWDAAVIDNSTNVWVDHVT1IDJ: WGGDAITLDDCDLVWIDHVT2BSP: SQYDNITINGGTHIWIDHCT1PLU: KDGDMIRVDDSPNVWVDHNE

1PCL: LRVTFHNNVFDRVTERAPRV1IDJ: DLVTMKGNYIYHTSGRSPKV2BSP: LKITLHHNRYKNIVQKAPRV1PLU: RNITYHHNYYNDVNARLPLQ

1PCL: TERAPRVRFGSIHAYNNVYL1IDJ: GRSPKVQDNTLLHCVNNYFY2BSP: VQKAPRVRFGQVHVYNNYYE1PLU: NARLPLQRGGLVHAYNNLYT

1PCL: AQTMTSSLATSINNNAGYGK1IDJ: SASAYTSVASRVVANAGQGN2BSP: SIDASANVKSNVINQAGAGK1PLU: SPVSAQCVKDKLPGYAGVGK

Ex: multiple structural alignment of 4 pectate and pectin lyases: 1PCL,1IDJ,2BSP,1PLU

Page 20: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Structural similarity search in databases: YAKUSA

fast structuralscanning

query

database (PDB)

Page 21: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Structural similarity search in databases: YAKUSA

Structures encoded with angles

For each database protein:1- find structural similar seeds with the query2 - extend seeds to longer structural matches

Then rank the structural hits

Page 22: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Seeds of length kAll overlapping k-patterns of the query -> automaton

1

2

2

2

1

2 1

1

Dictionary/Automaton

2

2 1Leaf->one seedAdvantage: no moving backward in the database string; less moving backward in the patterns

Page 23: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Query seeds automaton

1

2

2

2

1

2 1

1

Dictionary/Automaton

dc(i, ’i) <

'1

'22'

2

'2

'1

1

'1

1

2

'1

2'2

'2

'1 1

'1

'2

'1

'2

'11

2'

2 2

'2

dc(i,’i) <

with degeneracy with similar patterns

Page 24: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Matching seeds to SHSP

seed

SHSP

- (many seeds, giving only several SHSP)

query

database

- SHSP=maximal scoring region around the seed, the scores being based upon the angular differences

- probability associated with SHSP (MTD: pair approximation of Markov model)

Page 25: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

SHSPs to results…

Database structures

found, ranked by

score

http://bioserv.rpbs.jussieu.fr/Yakusa

40 seconds for scanning 11000 PDB structures

Page 26: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

query (Deoxyribonuclease I)

database(Heat-Shock protein 70)

http://bioserv.rpbs.jussieu.fr/Yakusa

SHSPs

Example of output:

Page 27: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

selection of best 2-diagonals

Multiple structural alignment: « m-diagonals »

First step: finding pair “diagonals” (at the angle level)

Index in first protein

Ind

ex in s

eco

nd

pro

tein

Page 28: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Second step: combination of 2-diagonals into m-diagonals (in m dimensions).

Protein 1Protein 1

Protein 2Protein 2

Protein 3Protein 3m-diagonal in m dimensions : here, the 3-diagonal is the combination of three 2-diagonals of dimension 2.

Multiple structural alignment: « m-diagonals »

Page 29: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Protein 1Protein 1

Protein 3Protein 3

Protein 2Protein 2

2

3

5

Second step: combination of 2-diagonals into m-diagonals (in m dimensions).

m-diagonal in m dimensions : here, the 3-diagonal is the combination of three 2-diagonals of dimension 2.

Multiple structural alignment: « m-diagonals »

Page 30: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

“column” graphes : - a residu is a node- if two residues are in a

2-diagonal, they are connected by a link.

Protein 1Protein 1

Protein 2Protein 2

Protein 3Protein 3

Selection of best m-diagonals (most connected ones)

2

3

5

Second step: combination of 2-diagonals into m-diagonals (in m dimensions).

Multiple structural alignment: « m-diagonals »

Page 31: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

13 cytochromes P450

- In blue : non aligned parts

- Other colors : m-diagonals

Example:

Page 32: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Other method : «  Gibbs sampling »

« Taboo search »

Page 33: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

Alain ViariHenry Soldano

Mathilde Carpentier

Marie-France Sagot

Pascale Jean

Sophie Brouillet

Nadia Pisanti

Kmrc+gok

Cytochromes P450

Yakusam-diagonalsGibbs-taboo

relationalpatterns

Near future: classification of structural cores in PDB…

Page 34: Finding similarities in protein structures: a string approach IFBM 2004 Atelier de Bio-Informatique (ABI) - Université Paris VI

a

b

7 5 4 2 1

6 3

1 2 3 4 5 6 7

a a b a a b a

P1

V1

_a

_b 2 5 ;

1 4 ; 6 3

aa

ba

4 1

6 3

P2

ab 5 2

V2

Q2

aa

k=1

1 2 3 4 5 6 7

aaab ba ab ba

Look at +k

Put position in set stack

k=2

2k-length stacks

k-length stacks