comprehensive strategy for integrated target selection in structural genomics

31
NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York) Comprehensive strategy for integrated target selection in structural genomics Burkhard Rost CUBIC Columbia University http://cubic.bioc.columbia.edu/mis/ talks/ http://cubic.bioc.columbia.edu

Upload: gram

Post on 18-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Burkhard Rost CUBIC Columbia University http://cubic.bioc.columbia.edu/mis/talks/ http://cubic.bioc.columbia.edu. Comprehensive strategy for integrated target selection in structural genomics. Comprehensive strategy for integrated target selection. Our research goal and current reality - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

Comprehensive strategyfor integrated target selection

in structural genomics

Burkhard RostCUBIC Columbia University

http://cubic.bioc.columbia.edu/mis/talks/

http://cubic.bioc.columbia.edu

Page 2: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

Comprehensive strategy for integrated target selection

Our research goal and current realityUnit: sequence-structure familiesGoals: cover all entire families with good modelsSTAGE 1: CHOP + CLUP + filtering -> novel automatic organization of sequence-structure spaceSTAGE 2: Refined, manual selection ->model all family members? stop-work/hold-work?STAGE 3: Explore experimental structure

Answers and perspectivesHow many structures needed for completion?Euka-proka-archae: overlap?Why collaborate on targets?Multiplexing helpful? High-throughput protein production in eukaryotes?

Page 3: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

Computational biology & bioinformatics

Page 4: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

Sequence-structure family

Sequence-structure family U’

Sequence-structure family U

Page 5: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

EVA: comparative modelling

V Eyrich, MA Marti-Renom, D Przybylski, A Fiser, F Pazos, A Valencia, A Sali & B Rost (2001) Bioinformatics 17, 1242-1243MA Marti-Renom, MS Madhusudhan, A Fiser, B Rost, A Sali (2002) Structure 10, 435-440

Marc Marti Renom & Andrej Sali (UCSF)http://eva.compbio.ucsf.edu/~eva/cm/

http://cubic.bioc.columbia.edu/eva

Accuracy CoverageCumulative distribution

PSI-BLAST 10-3

Page 6: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

How to decide when we exclude/include?

C Sander & R Schneider 1991 Proteins, 9, 56-68B Rost 1999 Prot Engng, 12, 85-94

Page 7: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

Scooping families from proteomes, in practice

Problems:• domains• overlaps

.

0

20

40

60

80

100

0 50 100 150 200 250

Number of residues aligned

Sequence identityimplies structural

similarity !

Don't know region

Page 8: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

Choose targets: single-linkage clustering

Liu, Hegyi, Acton, Montelione & Rost 2003 Proteins, in pressLiu & Rost 2003 Proteins, submitted

~100,000 eukaryotic proteins

(yeast, fly, worm, weed, human)

22 112 clusters

46 318 in largest cluster

NONSENSE!

Conclusions:• NO clustering of full-

length proteins• have to chop into

structural-domain-like fragments

(single-linkage DOES work on PrISM)

Page 9: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

CHOP proteins into structural domains

Liu & Rost 2003 Proteins, submitted

Page 10: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

CHOP: dissection of proteins into domains

Liu, Hegyi, Acton, Montelione & Rost 2003 Proteins, in pressLiu & Rost 2003 Proteins, submitted

Single-domain proteins:61% in PDB28% in 62 proteomes

Average domain length• in proteins ≥ 2 domains:

~100 residues• in proteins with 1 domain:

1.7-3 times longer

Page 11: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

To take or not to take

Take if > 50 globular residues and no known 3D

Page 12: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

Structural residue coverage in reality (any)

J Liu & B Rost 2002 Bioinformatics, 18, 922-933

53%of residues

to do!

~28% ~19%

Page 13: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

If you believe 53% is pessimistic ...

53% residue coverage today based on E-value 1!!

Page 14: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

Clustering after CHOP

• 103 796 eukaryotic proteins (Yeast, Fly, Worm, Arabidopsis, Human/30)

247 222 domain-like fragments167 717 no PDB (E-value 10-1, HSSP-distance -3)

44 718 not good 4 us (membrane, coil, SEG, NORS, signal peptide)

• 122 999 2 go 95 330 non-singleton

Liu, Montelione & Rost 2003 Proteins, in press

Jinfeng21,000 fragment clusters

Page 15: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

Computational biology & bioinformatics

Page 16: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

Main goal of Stage 2 analysis

Refine Stage 1 automatic target selection through manual sequence analysis

Concept: USE comparative modeling and structural features directly for refined target selection

For each sequence-structure family from Stage 1:predict minimal set of exp. structures needed to high-quality model entire family.

Diana Murray, Cornell

Page 17: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

1. Fold recognition and sequence-to-structure profiles 2. Comparative modeling (PrISM, Nest) 3. Structure evaluation tools (e.g. Verify3d) 4. Calculate biophysical properties

Recommend 2 do additional structure if: 1) NESG-cluster members poorly modeled

2) Biophysical properties of models incompatible

with known function

3) Models suggest novel functionality

Toolbox Input: PDB + NESG cluster

Refinement protocol 4 new 3DTarget re-prioritization based on weekly PDB updates

Diana Murray, Cornell

Page 18: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

Target Status

IR21 solved, PDB: 1MOS

ET28 Purified

JR15 Expressed

TT777 Expressed

GR7 Expressed

AR12 Cloned

WR204 Selected

XR4 Expressed

Stop work

SPINE/ZebaView

Experimental structure of IR21 yielded high-quality modelsfor all members of this NESG sequence/structure family

Example of stop work recommendation

Diana Murray, Cornell

Page 19: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

NESG family: HR291 (99% identical to 1P9O), AR1731, HR2295, KR12, DR11 breaks intotwo clusters: A = (HR291, AR1731, HR2295) and B = (KR12, DR11)

Two structures required to cover family:Predicted by Stage 2 analysis and verified by Stage 3 analysis

HR291AR1731HR2295

HR291AR1731HR2295

HR291AR1731HR2295

KR12DR11

KR12DR11

A

B

Recommendation: Solve structure of KR12 (purified)

Diana Murray, Cornell

Page 20: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

Archaeal structureNESG ID: GR2; PDB ID: 1QXF

Archaeoglobus fulgidis S27e protein hasonly archae and eukaryotic members.

Archae and eukaryotes share conserved hydrophobic motif (yellow).

Only eukaryotes have N-terminal extension,and their modelshave strikingly different electrostatic properties.

Human protein recommended for structure determination!

Model suggests novel function: 30S ribosomal protein S27

Model for humanhomologue

Diana Murray, Cornell

Page 21: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

Summary Stage 2 refinement

Statistics:

Many families currently under investigation

Hold work recommendation: • family member at advanced experimental stage • predicted to yield good models for entire family-> hold-work for members at early exp. stages

re-assess once structure done!Diana Murray, Cornell

families targets result

62 200+145 stop-work40 110 hold-work

12 12 another 3D

Page 22: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

Computational biology & bioinformatics

Page 23: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

Exploit structure to speculate about function

43 no previous annotation about functiondefined by ‘no publication in biological journal’

39 analyzed31 result in some predictions about function

8 clear success: functional annotation achieved e.g. predicted active site based on structure typically: conformation of annotation transfer

23 some hints (16 ‘hypothetical proteins’) e.g. some clue about active site mostly completely new!

8 no clue

Sharon Goldsmith & Barry Honig

Page 24: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

Answers

How many structures needed for completion?

Euka-proka-archae: overlap?

Why collaborate on targets?

Multiplexing helpful?

High-throughput protein production in eukaryotes?

How many structures needed for completion?

Euka-proka-archae: overlap?

Why collaborate on targets?

Multiplexing helpful?

High-throughput protein production in eukaryotes?

Page 25: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

How many targets for prokaryotes + archae?

16,000 min 8,000 give:

72% fragments72% proteins67% residues

Page 26: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

How many targets for euka-proka-archae?

8,00

08,

000

8,000 give: 67% fragments67% proteins59% residues

BUT:50% of residues

remaining

Page 27: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

Overlap between euka-proka-archae?

• surprisingly small overlap overall• even lower for largest families• most big families are eukaryotic!

~60% of fragments from eukaryotes no sequence-structure family member from prokaryotes or archae

much higher for ‘largest 8,000’:2,690 (34%) proka+archae only

4,277 (53%) euka only

1,033 (13%) mix

Page 28: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

Why collaborate on target list?

competition between consortia has already hampered success-rate considerably!

32% overlap

Page 29: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

Does multiplexing help?

Date: 2003-07-28

Multiplex DOUBLES success rate!

~4%

Page 30: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

Integrated strategy

NESG unique, comprehensive, integrated strategy optimized to organize sequence space in structural terms:

Stage 1: CHOP+CLUP+filter yields high success in focusing on sequence-structure familiesStage 2: detailed refinement embeds comparative models into selection and optimizes structural coverage for familyStage 3: use experimental structure to increase structural family coverage and to allow functional exploitation

Needed to do ‘em all: ~38,000 non-singletons

8,000 largest -> 50% of the residues that remain!

Genomics: Surprises + our structural perspective changed the ‘world’! The revolutions continue ...

Page 31: Comprehensive strategy for integrated target selection in structural genomics

NIH-PSI Target Selection, Nov 13-14, 2003 © Burkhard Rost (Columbia New York)

ThanksgivingThanksgiving

$$: NIH/NSFData:Jinfeng Liu (CUBIC)Hedi Hegyi & Phil Carter (CUBIC), Marc-Marti Renom (UCSD)

NESG:Guy Montelione (Rutgers)Barry Honig (Columbia)Diana Murray (Cornell, NYC)Tom Acton (Rutgers), Liang Tong & John Hunt (Columbia), George DeTitta (Buffalo), Cheryl Arrowsmith (Toronto)Wayne Hendrickson (Columbia)EVA:Andrej Sali & Marc-Marti Renom (UCSD), Alfonso Valencia (Madrid)Volker Eyrich, Ingrid Koh & Dariusz Przybylski (CUBIC)

Data:Jinfeng Liu (CUBIC)Hedi Hegyi & Phil Carter (CUBIC), Marc-Marti Renom (UCSD)

NESG:Guy Montelione (Rutgers)Barry Honig (Columbia)Diana Murray (Cornell, NYC)Tom Acton (Rutgers), Liang Tong & John Hunt (Columbia), George DeTitta (Buffalo), Cheryl Arrowsmith (Toronto)Wayne Hendrickson (Columbia)EVA:Andrej Sali & Marc-Marti Renom (UCSD), Alfonso Valencia (Madrid)Volker Eyrich, Ingrid Koh & Dariusz Przybylski (CUBIC)