nigms protein structure initiative: target selection workshop adda and remote homologue detection

15
NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection Liisa Holm Institute of Biotechnology University of Helsinki

Upload: nardo

Post on 14-Jan-2016

32 views

Category:

Documents


0 download

DESCRIPTION

NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection. Liisa Holm Institute of Biotechnology University of Helsinki. Definitions. Nrdbxx = nrdb where no two sequences are more than xx % identical; redundant sequences are mapped to representative - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: NIGMS Protein Structure Initiative:  Target Selection Workshop ADDA and remote homologue detection

NIGMS Protein Structure Initiative: Target Selection Workshop

ADDA and remote homologue detection

Liisa HolmInstitute of Biotechnology

University of Helsinki

Page 2: NIGMS Protein Structure Initiative:  Target Selection Workshop ADDA and remote homologue detection

Definitions • Nrdbxx = nrdb where no two sequences are more

than xx % identical; redundant sequences are mapped to representative – Uniprot + Genpept + PIR + PDB + …– Nrdb100 – Nrdb90 – … – Nrdb40 – Nrdb30 = “modeling

family”

• PairsDB = database of all-against-all comparisons – Blast in nrdb90, PSI-Blast in nrdb40

• BIG = family detected by profile comparison– Profile needs seed set (alignment); automatic iterative

profile construction has poor convergence– Profiles Partially overlapping neighbour sets Need

to cluster sequences Clustering artefacts when true cluster shape is non-spherical

Page 3: NIGMS Protein Structure Initiative:  Target Selection Workshop ADDA and remote homologue detection

(graph) covering ≠ clustering ≠ classification• Incomplete detection of homologous set by profile models • Example: Urease et al. superfamily

IDEAL REAL

Page 4: NIGMS Protein Structure Initiative:  Target Selection Workshop ADDA and remote homologue detection

ADDA: clustering of domains into families• ADDA = Automatic Domain Definition Algorithm

– Heger & Holm (2003) J Mol Biol 328, 749-767. – Heger & al (2005) Nucl. Acids Res. 33 Database Issue, D188-D191.

• Principles of ADDA– Blast all-against-all comparison in nrdb90– Domains are optimally covered by alignments

• Complete domain coverage; every residue belongs to a domain– Minimum/maximum spanning tree of domains– Remove links where profile-profile score is below threshold– Connected components are domain families

• Quality assessment– Most ADDA families are pure, containing one PFAM family or SCOP

superfamily (plus previously unclassified members)– Occasionally members from different PFAM family are merged in one

ADDA family (contamination or PFAM misclassification) – Domain size distribution is reasonable

• For example, much less over-fragmentation than by Prodom algorithm

Page 5: NIGMS Protein Structure Initiative:  Target Selection Workshop ADDA and remote homologue detection

ADDA purity and domain size

<=10 <=100 <=1000 <=10000 0

5

10

15

20

25

30

35

0

5

10

15

20

25

30

35

rela

tive

fre

que

ncy

/ %

overlap or relative size / %

<=10 <=100 <=1000 <=10000 0

20

40

60

80

100

0

20

40

60

80

100

cu

mu

lativ

e fr

eque

ncy

/ %

0 20

40 60

80 100

0

200

400

600

800

1000

1200

0

20 40

60 80

100

selectivi ty

/ %

abso

lute

fre

qu

enc

y

sensitivity / %

0 20

40 60

80 100

0

50

100

150

200

0 20

40 60

80 100

selectivi ty

/ %

abs

olu

te f

req

ue

ncy

sensitivity / % PFAM SCOP

Accuracy of domain boundaries-Red: best possible in domain tree-Black: actually selected

Page 6: NIGMS Protein Structure Initiative:  Target Selection Workshop ADDA and remote homologue detection

3D coverage of model proteomes• PDB entries from May 2006

– Required greater than 80 % overlap between PDB sequence and ADDA domain to call family structurally covered

• ADDA domain families– BIG families

• 28429 families have more than ten members in nrdb100– 2383 structurally covered BIG families

• 8820 families have more than ten members in nrdb40– 1869 structurally covered BIG families

• NCBI genome sets– H sapiens, C elegans, D melanogaster, A thaliana, E

coli, B anthracis, T maritima – Mapped to ADDA families

• 6770 BIG(nrdb40) families occur in model genome set– 1705 structurally covered

Page 7: NIGMS Protein Structure Initiative:  Target Selection Workshop ADDA and remote homologue detection

Model genome coverage – 28429 BIG families in nrdb100

ADDA domain coverage of model genomes

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

hs ce dm at ec ba tm

Num

ber o

f dom

ains

domains_smalldomains_big_whitedomains_big_3D

ADDA domain coverage of model genomes

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

hs ce dm at ec ba tm

Per

cent

age

of d

omai

ns

domains_smalldomains_big_whitedomains_big_3D

ADDA family coverage of model genomes

0

5000

10000

15000

20000

25000

30000

hs ce dm at ec ba tm

Num

ber o

f fam

ilies

families_smallfamilies_big_whitefamilies_big_3D

ADDA family coverage of model genomes

0.00%

10.00%20.00%

30.00%40.00%

50.00%

60.00%70.00%

80.00%90.00%

100.00%

hs ce dm at ec ba tm

Per

cent

age

of fa

milie

s

families_smallfamilies_big_whitefamilies_big_3D

ADDA family coverage of model genomes

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

hs ce dm at ec ba tm

Per

cent

age

of B

IG fa

milie

s

families_big_whitefamilies_big_3D

ADDA domain coverage of model genomes

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

hs ce dm at ec ba tm

Per

cent

age

of d

omai

ns in

BIG

fa

milie

s

domains_big_whitedomains_big_3D

T. maritima would be covered by 1000 BIG families and is two thirds done

Page 8: NIGMS Protein Structure Initiative:  Target Selection Workshop ADDA and remote homologue detection

ADDA domains

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

hs ce dm at ec ba tm

domains_smalldomains_big_whitedomains_big_3D

ADDA domains

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

hs ce dm at ec ba tmPer

cent

age

of B

IG d

omai

ns (n

rdb4

0)

domains_big_whitedomains_big_3D

ADDA families

0

5000

10000

15000

20000

25000

30000

hs ce dm at ec ba tm

Num

ber o

f fam

ilies

families_smallfamilies_big_whitefamilies_big_3D

ADDA families

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

hs ce dm at ec ba tm

Per

cent

age

of B

IG fa

milie

s (n

rdb4

0)families_big_whitefamilies_big_3D

6770 BIG familiesin nrdb40

Multigenefamilies ineukaryotes

2.0-2.4 domainsper euk. gene;1.3 domains per prok. gene

Page 9: NIGMS Protein Structure Initiative:  Target Selection Workshop ADDA and remote homologue detection

Seven model genomes

Human Worm, fly, plant

Prokaryotes (E coli, B anthracis, T maritima)

638 2412

29

187

802

161

836

Human BIG target families are almost exclusively eukaryote-specific

57 535

18

588

88

264

155

Human Worm, fly, plant

Prokaryotes

Universal BIG families are almost covered

5065 white BIG target families 1705 structurally covered BIG families

Page 10: NIGMS Protein Structure Initiative:  Target Selection Workshop ADDA and remote homologue detection

Covering all modelling families will have astronomical cost

• Nrdbxx updates; Nrdb30 = “modelling family”

100000

1000000

10000000

v1

v2

v3

v4

Page 11: NIGMS Protein Structure Initiative:  Target Selection Workshop ADDA and remote homologue detection

Fine-grained coverage

• MF: Structural core shrinks rapidly below 30 % sequence identity Need less naïve

modelling software capable of building those parts ab initio which are not covered by template

– Misalignment is major source of error Transitive alignment covers more of the structurally equivalent core

Average coverage of structural core (152 pairs in 11 superfamilies):Transitive 51 % Global alignment (HMMer) 43 %Local alignment (PSI-Blast) 34 %

Error Rmsd/A

Template 32

Misaligned 16

Loops 8

Backbone 4

Rotamers 2

Page 12: NIGMS Protein Structure Initiative:  Target Selection Workshop ADDA and remote homologue detection

Coarse-grained coverage

• BIG/BIGGER: Homology detection– Difficulty of aligning

remote homologues Shared sequence motifs suggest conserved biochemical mechanism Functional classification

– Sequence comparison only detects half of remote homologue pairs Structure comparison reveals missing links

30

40

50

60

70

80

90

100

0 0.1 0.2 0.3 0.4 0.5 0.6

coverage

reli

abil

ity

our methodprofile-profile

Transitive search for conserved motifsdetects more remote homologues than profile-profile comparison

Page 13: NIGMS Protein Structure Initiative:  Target Selection Workshop ADDA and remote homologue detection

Clustering PFAM families

• Comparison of ADDA to PFAM-A resulted in extension but no discovery of completely new large families

• PFAM-A v.19: 7340 families, 2451 covered according to PFAM’s assignments, 1396 families in 205 clans

• Our method achieved 30 % coverage of clan relationships at 5 % error rate compared to 23 % coverage at 5 % error rate by profile-profile comparison– 1083 unclassified PFAMs linked to 205 known clans

• 1219 white PFAMs linked to known structure in 155 clusters

– 1256 PFAMs clustered in 470 predicted clans• 336 white PFAMs linked to known structure in 222 clusters

– 3610 PFAMS remained singletons• 2352 white PFAMs

2451 covered, ~1555 fold assignments, ~3334 targets

Page 14: NIGMS Protein Structure Initiative:  Target Selection Workshop ADDA and remote homologue detection

Conclusions • ADDA ~3000 human target families

– ~40 % reduction in number of PFAM target families by fold assignment (based on sequence only)

• Coarse-grained coverage yields information out of reach to sequence comparison– Need to improve measures of sequence similarity to infer

homology• Sequence motif-based functional classification

– Need to increase the radius of convergence in template-based structure prediction

• Protein complexes hypothesis-driven research– Large conformational changes– Multigene receptor-ligand pair discrimination involves

rotations in docking orientation

Page 15: NIGMS Protein Structure Initiative:  Target Selection Workshop ADDA and remote homologue detection

Acknowledgements

• Andreas Heger, Oxford University

• Swapan Mallick, Ashwin Sivakumar, Chris Wilton, Institute of Biotechnology

• Funding: Academy of Finland, Sigrid Juselius Foundation, EU