comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

Comprehensive strategyfor integrated target selection

in structural genomics

Burkhard RostCUBIC Columbia University

http://cubic.bioc.columbia.edu/mis/talks/

http://cubic.bioc.columbia.edu


Comprehensive strategy for integrated target selection

Our research goal and current realityUnit: sequence-structure familiesGoals: cover all entire families with good modelsSTAGE 1: CHOP + CLUP + filtering -> novel automatic organization of sequence-structure spaceSTAGE 2: Refined, manual selection ->model all family members? stop-work/hold-work?STAGE 3: Explore experimental structure

Answers and perspectivesHow many structures needed for completion?Euka-proka-archae: overlap?Why collaborate on targets?Multiplexing helpful? High-throughput protein production in eukaryotes?


Computational biology & bioinformatics


Sequence-structure family

Sequence-structure family U’

Sequence-structure family U


EVA: comparative modelling

V Eyrich, MA Marti-Renom, D Przybylski, A Fiser, F Pazos, A Valencia, A Sali & B Rost (2001) Bioinformatics 17, 1242-1243MA Marti-Renom, MS Madhusudhan, A Fiser, B Rost, A Sali (2002) Structure 10, 435-440

Marc Marti Renom & Andrej Sali (UCSF)http://eva.compbio.ucsf.edu/~eva/cm/

http://cubic.bioc.columbia.edu/eva

Accuracy CoverageCumulative distribution

PSI-BLAST 10-3


How to decide when we exclude/include?

C Sander & R Schneider 1991 Proteins, 9, 56-68B Rost 1999 Prot Engng, 12, 85-94


Scooping families from proteomes, in practice

Problems:• domains• overlaps

.

0

20

40

60

80

100

0 50 100 150 200 250

Number of residues aligned

Sequence identityimplies structural

similarity !

Don't know region


Choose targets: single-linkage clustering

Liu, Hegyi, Acton, Montelione & Rost 2003 Proteins, in pressLiu & Rost 2003 Proteins, submitted

~100,000 eukaryotic proteins

(yeast, fly, worm, weed, human)

22 112 clusters

46 318 in largest cluster

NONSENSE!

Conclusions:• NO clustering of full-

length proteins• have to chop into

structural-domain-like fragments

(single-linkage DOES work on PrISM)


CHOP proteins into structural domains

Liu & Rost 2003 Proteins, submitted


CHOP: dissection of proteins into domains

Liu, Hegyi, Acton, Montelione & Rost 2003 Proteins, in pressLiu & Rost 2003 Proteins, submitted

Single-domain proteins:61% in PDB28% in 62 proteomes

Average domain length• in proteins ≥ 2 domains:

~100 residues• in proteins with 1 domain:

1.7-3 times longer


To take or not to take

Take if > 50 globular residues and no known 3D


Structural residue coverage in reality (any)

J Liu & B Rost 2002 Bioinformatics, 18, 922-933

53%of residues

to do!

~28% ~19%


If you believe 53% is pessimistic ...

53% residue coverage today based on E-value 1!!


Clustering after CHOP

• 103 796 eukaryotic proteins (Yeast, Fly, Worm, Arabidopsis, Human/30)

247 222 domain-like fragments167 717 no PDB (E-value 10-1, HSSP-distance -3)

44 718 not good 4 us (membrane, coil, SEG, NORS, signal peptide)

• 122 999 2 go 95 330 non-singleton

Liu, Montelione & Rost 2003 Proteins, in press

Jinfeng21,000 fragment clusters


Main goal of Stage 2 analysis

Refine Stage 1 automatic target selection through manual sequence analysis

Concept: USE comparative modeling and structural features directly for refined target selection

For each sequence-structure family from Stage 1:predict minimal set of exp. structures needed to high-quality model entire family.

Diana Murray, Cornell


1. Fold recognition and sequence-to-structure profiles 2. Comparative modeling (PrISM, Nest) 3. Structure evaluation tools (e.g. Verify3d) 4. Calculate biophysical properties

Recommend 2 do additional structure if: 1) NESG-cluster members poorly modeled

2) Biophysical properties of models incompatible

with known function

3) Models suggest novel functionality

Toolbox Input: PDB + NESG cluster

Refinement protocol 4 new 3DTarget re-prioritization based on weekly PDB updates



Target Status

IR21 solved, PDB: 1MOS

ET28 Purified

JR15 Expressed

TT777 Expressed

GR7 Expressed

AR12 Cloned

WR204 Selected

XR4 Expressed

Stop work

SPINE/ZebaView

Experimental structure of IR21 yielded high-quality modelsfor all members of this NESG sequence/structure family

Example of stop work recommendation



NESG family: HR291 (99% identical to 1P9O), AR1731, HR2295, KR12, DR11 breaks intotwo clusters: A = (HR291, AR1731, HR2295) and B = (KR12, DR11)

Two structures required to cover family:Predicted by Stage 2 analysis and verified by Stage 3 analysis

HR291AR1731HR2295

HR291AR1731HR2295

HR291AR1731HR2295

KR12DR11

KR12DR11

A

B

Recommendation: Solve structure of KR12 (purified)



Archaeal structureNESG ID: GR2; PDB ID: 1QXF

Archaeoglobus fulgidis S27e protein hasonly archae and eukaryotic members.

Archae and eukaryotes share conserved hydrophobic motif (yellow).

Only eukaryotes have N-terminal extension,and their modelshave strikingly different electrostatic properties.

Human protein recommended for structure determination!

Model suggests novel function: 30S ribosomal protein S27

Model for humanhomologue



Summary Stage 2 refinement

Statistics:

Many families currently under investigation

Hold work recommendation: • family member at advanced experimental stage • predicted to yield good models for entire family-> hold-work for members at early exp. stages

re-assess once structure done!Diana Murray, Cornell

families targets result

62 200+145 stop-work40 110 hold-work

12 12 another 3D


Exploit structure to speculate about function

43 no previous annotation about functiondefined by ‘no publication in biological journal’

39 analyzed31 result in some predictions about function

8 clear success: functional annotation achieved e.g. predicted active site based on structure typically: conformation of annotation transfer

23 some hints (16 ‘hypothetical proteins’) e.g. some clue about active site mostly completely new!

8 no clue

Sharon Goldsmith & Barry Honig


Answers

How many structures needed for completion?

Euka-proka-archae: overlap?

Why collaborate on targets?

Multiplexing helpful?

High-throughput protein production in eukaryotes?

How many structures needed for completion?

Euka-proka-archae: overlap?

Why collaborate on targets?

Multiplexing helpful?

High-throughput protein production in eukaryotes?


How many targets for prokaryotes + archae?

16,000 min 8,000 give:

72% fragments72% proteins67% residues


How many targets for euka-proka-archae?

8,00

08,

000

8,000 give: 67% fragments67% proteins59% residues

BUT:50% of residues

remaining


Overlap between euka-proka-archae?

• surprisingly small overlap overall• even lower for largest families• most big families are eukaryotic!

~60% of fragments from eukaryotes no sequence-structure family member from prokaryotes or archae

much higher for ‘largest 8,000’:2,690 (34%) proka+archae only

4,277 (53%) euka only

1,033 (13%) mix


Why collaborate on target list?

competition between consortia has already hampered success-rate considerably!

32% overlap


Does multiplexing help?

Date: 2003-07-28

Multiplex DOUBLES success rate!

~4%


Integrated strategy

NESG unique, comprehensive, integrated strategy optimized to organize sequence space in structural terms:

Stage 1: CHOP+CLUP+filter yields high success in focusing on sequence-structure familiesStage 2: detailed refinement embeds comparative models into selection and optimizes structural coverage for familyStage 3: use experimental structure to increase structural family coverage and to allow functional exploitation

Needed to do ‘em all: ~38,000 non-singletons

8,000 largest -> 50% of the residues that remain!

Genomics: Surprises + our structural perspective changed the ‘world’! The revolutions continue ...


ThanksgivingThanksgiving

$$: NIH/NSFData:Jinfeng Liu (CUBIC)Hedi Hegyi & Phil Carter (CUBIC), Marc-Marti Renom (UCSD)

NESG:Guy Montelione (Rutgers)Barry Honig (Columbia)Diana Murray (Cornell, NYC)Tom Acton (Rutgers), Liang Tong & John Hunt (Columbia), George DeTitta (Buffalo), Cheryl Arrowsmith (Toronto)Wayne Hendrickson (Columbia)EVA:Andrej Sali & Marc-Marti Renom (UCSD), Alfonso Valencia (Madrid)Volker Eyrich, Ingrid Koh & Dariusz Przybylski (CUBIC)

Data:Jinfeng Liu (CUBIC)Hedi Hegyi & Phil Carter (CUBIC), Marc-Marti Renom (UCSD)

NESG:Guy Montelione (Rutgers)Barry Honig (Columbia)Diana Murray (Cornell, NYC)Tom Acton (Rutgers), Liang Tong & John Hunt (Columbia), George DeTitta (Buffalo), Cheryl Arrowsmith (Toronto)Wayne Hendrickson (Columbia)EVA:Andrej Sali & Marc-Marti Renom (UCSD), Alfonso Valencia (Madrid)Volker Eyrich, Ingrid Koh & Dariusz Przybylski (CUBIC)

comprehensive strategy for integrated target selection in structural genomics

Documents