comprehensive strategy for integrated target selection in structural genomics
DESCRIPTION
Burkhard Rost CUBIC Columbia University http://cubic.bioc.columbia.edu/mis/talks/ http://cubic.bioc.columbia.edu. Comprehensive strategy for integrated target selection in structural genomics. Comprehensive strategy for integrated target selection. Our research goal and current reality - PowerPoint PPT PresentationTRANSCRIPT
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
Comprehensive strategyfor integrated target selection
in structural genomics
Burkhard RostCUBIC Columbia University
http://cubic.bioc.columbia.edu/mis/talks/
http://cubic.bioc.columbia.edu
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
Comprehensive strategy for integrated target selection
Our research goal and current realityUnit: sequence-structure familiesGoals: cover all entire families with good modelsSTAGE 1: CHOP + CLUP + filtering -> novel automatic organization of sequence-structure spaceSTAGE 2: Refined, manual selection ->model all family members? stop-work/hold-work?STAGE 3: Explore experimental structure
Answers and perspectivesHow many structures needed for completion?Euka-proka-archae: overlap?Why collaborate on targets?Multiplexing helpful? High-throughput protein production in eukaryotes?
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
Computational biology & bioinformatics
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
Sequence-structure family
Sequence-structure family U’
Sequence-structure family U
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
EVA: comparative modelling
V Eyrich, MA Marti-Renom, D Przybylski, A Fiser, F Pazos, A Valencia, A Sali & B Rost (2001) Bioinformatics 17, 1242-1243MA Marti-Renom, MS Madhusudhan, A Fiser, B Rost, A Sali (2002) Structure 10, 435-440
Marc Marti Renom & Andrej Sali (UCSF)http://eva.compbio.ucsf.edu/~eva/cm/
http://cubic.bioc.columbia.edu/eva
Accuracy CoverageCumulative distribution
PSI-BLAST 10-3
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
How to decide when we exclude/include?
C Sander & R Schneider 1991 Proteins, 9, 56-68B Rost 1999 Prot Engng, 12, 85-94
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
Scooping families from proteomes, in practice
Problems:• domains• overlaps
.
0
20
40
60
80
100
0 50 100 150 200 250
Number of residues aligned
Sequence identityimplies structural
similarity !
Don't know region
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
Choose targets: single-linkage clustering
Liu, Hegyi, Acton, Montelione & Rost 2003 Proteins, in pressLiu & Rost 2003 Proteins, submitted
~100,000 eukaryotic proteins
(yeast, fly, worm, weed, human)
22 112 clusters
46 318 in largest cluster
NONSENSE!
Conclusions:• NO clustering of full-
length proteins• have to chop into
structural-domain-like fragments
(single-linkage DOES work on PrISM)
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
CHOP proteins into structural domains
Liu & Rost 2003 Proteins, submitted
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
CHOP: dissection of proteins into domains
Liu, Hegyi, Acton, Montelione & Rost 2003 Proteins, in pressLiu & Rost 2003 Proteins, submitted
Single-domain proteins:61% in PDB28% in 62 proteomes
Average domain length• in proteins ≥ 2 domains:
~100 residues• in proteins with 1 domain:
1.7-3 times longer
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
To take or not to take
Take if > 50 globular residues and no known 3D
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
Structural residue coverage in reality (any)
J Liu & B Rost 2002 Bioinformatics, 18, 922-933
53%of residues
to do!
~28% ~19%
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
If you believe 53% is pessimistic ...
53% residue coverage today based on E-value 1!!
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
Clustering after CHOP
• 103 796 eukaryotic proteins (Yeast, Fly, Worm, Arabidopsis, Human/30)
247 222 domain-like fragments167 717 no PDB (E-value 10-1, HSSP-distance -3)
44 718 not good 4 us (membrane, coil, SEG, NORS, signal peptide)
• 122 999 2 go 95 330 non-singleton
Liu, Montelione & Rost 2003 Proteins, in press
Jinfeng21,000 fragment clusters
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
Computational biology & bioinformatics
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
Main goal of Stage 2 analysis
Refine Stage 1 automatic target selection through manual sequence analysis
Concept: USE comparative modeling and structural features directly for refined target selection
For each sequence-structure family from Stage 1:predict minimal set of exp. structures needed to high-quality model entire family.
Diana Murray, Cornell
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
1. Fold recognition and sequence-to-structure profiles 2. Comparative modeling (PrISM, Nest) 3. Structure evaluation tools (e.g. Verify3d) 4. Calculate biophysical properties
Recommend 2 do additional structure if: 1) NESG-cluster members poorly modeled
2) Biophysical properties of models incompatible
with known function
3) Models suggest novel functionality
Toolbox Input: PDB + NESG cluster
Refinement protocol 4 new 3DTarget re-prioritization based on weekly PDB updates
Diana Murray, Cornell
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
Target Status
IR21 solved, PDB: 1MOS
ET28 Purified
JR15 Expressed
TT777 Expressed
GR7 Expressed
AR12 Cloned
WR204 Selected
XR4 Expressed
Stop work
SPINE/ZebaView
Experimental structure of IR21 yielded high-quality modelsfor all members of this NESG sequence/structure family
Example of stop work recommendation
Diana Murray, Cornell
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
NESG family: HR291 (99% identical to 1P9O), AR1731, HR2295, KR12, DR11 breaks intotwo clusters: A = (HR291, AR1731, HR2295) and B = (KR12, DR11)
Two structures required to cover family:Predicted by Stage 2 analysis and verified by Stage 3 analysis
HR291AR1731HR2295
HR291AR1731HR2295
HR291AR1731HR2295
KR12DR11
KR12DR11
A
B
Recommendation: Solve structure of KR12 (purified)
Diana Murray, Cornell
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
Archaeal structureNESG ID: GR2; PDB ID: 1QXF
Archaeoglobus fulgidis S27e protein hasonly archae and eukaryotic members.
Archae and eukaryotes share conserved hydrophobic motif (yellow).
Only eukaryotes have N-terminal extension,and their modelshave strikingly different electrostatic properties.
Human protein recommended for structure determination!
Model suggests novel function: 30S ribosomal protein S27
Model for humanhomologue
Diana Murray, Cornell
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
Summary Stage 2 refinement
Statistics:
Many families currently under investigation
Hold work recommendation: • family member at advanced experimental stage • predicted to yield good models for entire family-> hold-work for members at early exp. stages
re-assess once structure done!Diana Murray, Cornell
families targets result
62 200+145 stop-work40 110 hold-work
12 12 another 3D
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
Computational biology & bioinformatics
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
Exploit structure to speculate about function
43 no previous annotation about functiondefined by ‘no publication in biological journal’
39 analyzed31 result in some predictions about function
8 clear success: functional annotation achieved e.g. predicted active site based on structure typically: conformation of annotation transfer
23 some hints (16 ‘hypothetical proteins’) e.g. some clue about active site mostly completely new!
8 no clue
Sharon Goldsmith & Barry Honig
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
Answers
How many structures needed for completion?
Euka-proka-archae: overlap?
Why collaborate on targets?
Multiplexing helpful?
High-throughput protein production in eukaryotes?
How many structures needed for completion?
Euka-proka-archae: overlap?
Why collaborate on targets?
Multiplexing helpful?
High-throughput protein production in eukaryotes?
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
How many targets for prokaryotes + archae?
16,000 min 8,000 give:
72% fragments72% proteins67% residues
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
How many targets for euka-proka-archae?
8,00
08,
000
8,000 give: 67% fragments67% proteins59% residues
BUT:50% of residues
remaining
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
Overlap between euka-proka-archae?
• surprisingly small overlap overall• even lower for largest families• most big families are eukaryotic!
~60% of fragments from eukaryotes no sequence-structure family member from prokaryotes or archae
much higher for ‘largest 8,000’:2,690 (34%) proka+archae only
4,277 (53%) euka only
1,033 (13%) mix
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
Why collaborate on target list?
competition between consortia has already hampered success-rate considerably!
32% overlap
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
Does multiplexing help?
Date: 2003-07-28
Multiplex DOUBLES success rate!
~4%
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
Integrated strategy
NESG unique, comprehensive, integrated strategy optimized to organize sequence space in structural terms:
Stage 1: CHOP+CLUP+filter yields high success in focusing on sequence-structure familiesStage 2: detailed refinement embeds comparative models into selection and optimizes structural coverage for familyStage 3: use experimental structure to increase structural family coverage and to allow functional exploitation
Needed to do ‘em all: ~38,000 non-singletons
8,000 largest -> 50% of the residues that remain!
Genomics: Surprises + our structural perspective changed the ‘world’! The revolutions continue ...
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)
ThanksgivingThanksgiving
$$: NIH/NSFData:Jinfeng Liu (CUBIC)Hedi Hegyi & Phil Carter (CUBIC), Marc-Marti Renom (UCSD)
NESG:Guy Montelione (Rutgers)Barry Honig (Columbia)Diana Murray (Cornell, NYC)Tom Acton (Rutgers), Liang Tong & John Hunt (Columbia), George DeTitta (Buffalo), Cheryl Arrowsmith (Toronto)Wayne Hendrickson (Columbia)EVA:Andrej Sali & Marc-Marti Renom (UCSD), Alfonso Valencia (Madrid)Volker Eyrich, Ingrid Koh & Dariusz Przybylski (CUBIC)
Data:Jinfeng Liu (CUBIC)Hedi Hegyi & Phil Carter (CUBIC), Marc-Marti Renom (UCSD)
NESG:Guy Montelione (Rutgers)Barry Honig (Columbia)Diana Murray (Cornell, NYC)Tom Acton (Rutgers), Liang Tong & John Hunt (Columbia), George DeTitta (Buffalo), Cheryl Arrowsmith (Toronto)Wayne Hendrickson (Columbia)EVA:Andrej Sali & Marc-Marti Renom (UCSD), Alfonso Valencia (Madrid)Volker Eyrich, Ingrid Koh & Dariusz Przybylski (CUBIC)