design of a novel globular protein fold with atomic-level accuracy · 2017. 3. 14. · design of a...
TRANSCRIPT
Design of a Novel Globular Protein Fold with Atomic-Level Accuracy
Science (2003). 302: 1364-1368.
Brian Kuhlman, Gautam Dantas, Gregory C. Ireton, Gabriele Varani, Barry L. Stoddard, David Baker
1
Overview Background Initialzation Iteration Results Conclusion
Goal:✦ Given a novel 3D topology, develop a corresponding protein sequence.
Choose a topology that does not exist in known protein structures.
Find the amino-acid sequence that leads to this topology.
Approach:
Question:✦ Can we design 3D protein structures that have never been observed?
✦ Are the 3D structures that have not been observed in nature unattainable, or just unsampled?
✦ Iterate between sequence optimization and structure prediction.
Devlop sequence-structure pairs with low free energy.
Low free energy corresponds to high structural stability.
2
Overview Background Initialzation Iteration Results Conclusion
Structure Redesign: Reina, et al. (2002).
✦ Redesigned the binding site of a known fold.
Two PDZ-ligand complexes, with the proposed mutations (green).
✦ Starting point: PDZ domain of PST-95 protein.
Binds to a specific C-terminal motif of target proteins.
✦ Computational process (Perla).
Manually identified residues that interact with ligand.
New amino acids chosen for each position based on:
• geometry
• conformation
Further restricted set by pairwise comparisons.
Ranked resulting structures based on free energy of complex.
Repeated process with resulting set of structures.
✦ Specific approach; difficult to generalize.
J. Reina, et al. Nature Struct. Biol., 9(8): 621-627, 2002.
3
Overview Background Initialzation Iteration Results Conclusion
Sequence prediction: Dahiyat & Mayo (1997).
✦ Computed the optimal sequence for a known fold.
Zif269 zinc-finger domain II (top) and FSD-1 NMR structure (bottom).
ther, the BLAST search found only lowidentity matches of weak statistical signifi-cance to fragments of various unrelated pro-teins. The highest identity matches were 10residues (36 percent) with P values rangingfrom 0.63 to 1.0, where P is the probability of
a match being a chance occurrence. Random28-residue sequences that consist of aminoacids allowed in the !!" position classifica-tion described above produced similarBLAST search results, with 10- or 11-residueidentities (36 to 39 percent) and P values
ranging from 0.35 to 1.0, further suggestingthat the matches for FSD-1 are statisticallyinsignificant. The low identity with anyknown protein sequence demonstrates thenovelty of the FSD-1 sequence and under-scores that no sequence information fromany protein motif was used in our sequencescoring function.
In order to examine the robustness of thecomputed sequence, we used the sequence ofFSD-1 as the starting point of a Monte Carlosimulated annealing run. The Monte Carlosearch revealed high scoring, suboptimal se-quences in the neighborhood of the optimalsolution (4). The energy spread from theground-state solution to the 1000th moststable sequence is about 5 kcal/mol, an indi-cation that the density of states is high. Theamino acids comprising the core of the mol-ecule, with the exception of position 7, areessentially invariant (Fig. 1). Almost all ofthe sequence variation occurs at surface po-sitions, and typically involves conservativechanges. Asn14, which is predicted to form astabilizing hydrogen bond to the helix back-
A
B
Fig. 2. Comparison of Zif268 (9) and computed FSD-1 structures. (A) Stereoview of the second zincfinger module of Zif268 showing its buried residues and zinc binding site. (B) Stereoview of thecomputed orientations of buried side chains in FSD-1. For clarity, only side chains from residues 3, 5, 8,12, 18, 21, 22, and 25 are shown. Color figures were created with MOLMOL (38).
Table 1. NMR structure determination: distance restraints, structural statistics, and atomic root-mean-square (rms) deviations. #SA$ are the 41 simulated annealing structures, SA is the average structurebefore energy minimization, (SA )r is the restrained energy minimized average structure, and SD is thestandard deviation.
Distance restraints
Intraresidue 97Sequential 83Short range (!i – j! % 2 to 5 residues) 59Long range (!i – j! & 5 residues) 35Hydrogen bond 10Total 284
Structural statisticsrms deviations #SA$ ' SD (SA)r
Distance restraints (Å) 0.043 ' 0.003 0.038Idealized geometry
Bonds (Å) 0.0041 ' 0.0002 0.0037Angles (degrees) 0.67 ' 0.02 0.65Impropers (degrees) 0.53 ' 0.05 0.51
Atomic rms deviations (Å)*#SA$ versus SA ' SD #SA$ versus (SA)r ' SD
Backbone 0.54 ' 0.15 0.69 ' 0.16Backbone ( nonpolar side chains† 0.99 ' 0.17 1.16 ' 0.18Heavy atoms 1.43 ' 0.20 1.90 ' 0.29
*Atomic rms deviations are for residues 3 to 26, inclusive. Residues 1, 2, 27, and 28 were disordered [), *, angularorder parameters (34) + 0.78] and had only sequential and !i – j! % 2 NOEs. †Nonpolar side chains are fromresidues Tyr3, Ala5, Ile7, Phe12, Leu18, Phe21, Ile22, and Phe25, which constitute the core of the protein.
Fig. 3. Circular dichroism (CD) measurements ofFSD-1. (A) Far-UV CD spectrum of FSD-1 at 1°C.The minima at 220 and 207 nm indicate a foldedstructure. (B) Thermal unfolding of FSD-1 moni-tored by CD. The melting curve has an inflectionpoint at 39°C. To illustrate the cooperativity of thethermal transition, the melting curve was fit to atwo-state model [(39) and the derivative of the fit isshown (inset)]. The melting temperature deter-mined from this fit is 42°C.
SCIENCE ! VOL. 278 ! 3 OCTOBER 1997 ! www.sciencemag.org84
✦ Starting point: Zinc-finger domain of Zif268.
Used a small domain from the PDB as a template.
✦ Computational process.
Started with all possible combinations at all positions.
Restricted to allowed rotamers.
Limited sequence space by dead-end elimination.
Optimized sequence using various interactions:
• backbone-backbone
• backbone-side chain
• side chain-side chain
✦ Structure does not change during computation.B. I. Dahiyat and S. L. Mayo.
Science, 278(5335): 82–87, 1997.
4
Overview Background Initialzation Iteration Results Conclusion
Designing a protein from scratch: this study.
✦ Why might a target structure (or fold) be unknown?
Maybe it just hasn't been sampled by scientists yet.
Maybe it hasn't been tried during protein evolution.
Maybe it is not designable at all.
✦ How do we design with the fewest number of initial constraints?
Previous studies have started from a small set of sequences or structures.
Not necessarily a globally optimal fold or sequence.
Not widely applicable, but specific (e.g., for a certain function).
✦ Obstacles to pure de novo design:
Must vary both sequence and structure space during design process.
Potentially large computational cost.
5
Overview Background Initialzation Iteration Results Conclusion
General algorithim design.
Choose topology
Design sequences
Optimize model
Generate starting models
Rank
Select lowest energy sequence
Design sequence
Optimize model
Choose topology Generate starting models Rank
Select
6
Overview Background Initialization Iteration Results Conclusion
Choose a target topology.
Novel α/β topology chosen as the design template. Initial constrains on the structure were defined by hydrogen bond interactions (purple arrows).
Design sequence
Optimize model
Choose topology Generate starting models Rank
Select
7
Overview Background Initialization Iteration Results Conclusion
Generate starting models from topology.
✦ Chose fragments of existing proteins that fit initial constraints.
Used fragments of 3-9 amino acids in length.
Taken from structures in PDB.
Fit secondary structures in topology diagram.
✦ Assembled fragments into set of backbone models.
Used Rosetta software to build 172 starting models.
Models are backbone only; side chain packing is ignored at this step.
Models are fairly close structurally (RMSDs of 2-3 Å).
Design sequence
Optimize model
Choose topology Generate starting models Rank
Select
8
Overview Background Initialzation Iteration Results Conclusion
Step 3: Generate a sequence for each model.
✦ Used RosettaDesign to generate sequence.
RosettaDesign uses a Monte Carlo search method and energy function.
Monte Carlo method is a stochastic method for solving computational problem with many variables.
Generally solved by taking lots of random samples and analyzing patterns in the result.
✦ RosettaDesign energy function based on:
Lennard-Jones potential.
Hydrogen bonding.
Implicit solvation.
✦ Further restricted certain positions.
Only polar residues allowed in surface β sheets.
Cysteines disallowed.Lennard-Jones potential for Ar.
http://en.wikipedia.org/wiki/Lennard_Jones_potential
Design sequence
Optimize model
Choose topology Generate starting models Rank
Select
9
Overview Background Initialzation Iteration Results Conclusion
Rank and select.
✦ Sequences ranked according to free energy.
Initial (starting) sequences had higher free energies than natural proteins.
Explained by lack of side chain packing constraints in design.
Reversed course during backbone optimization.
✦ Sequence with the lowest free energy used in optimization.
First round uses the starting model that generated this sequence.
Subsequent rounds use the optimized model (input structure).
Design sequence
Optimize model
Choose topology Generate starting models Rank
Select
10
Overview Background Initialzation Iteration Results Conclusion
Optimize backbone model.
✦ Goal: Identify lowest free energy model for a fixed sequence.
✦ Process: Perturb, relax, repeat.
1. Perturb the structure.
• random change to 1-5 torsion angles; or
• replace 1-3 random torsion angles with selection from the PDB.
2. Optimize any high-energy side chains.
• cycle through each position in the model that has a higher energy after perturbation.
• replace side chain with lowest energy rotamer.
3. Optimize region around the site of perturbation.
• minimize energy in 10 residue window around insertion.
Design sequence
Optimize model
Choose topology Generate starting models Rank
Select
11
Overview Background Initialzation Iteration Results Conclusion
Crunching the numbers.
✦ 172 starting models.
✦ 5 simulations for each model: 860 simulations.
Design sequence
Optimize model
Choose topology Generate starting models Rank
Select
✦ 15 rounds of sequence-design/model-optimization for each simulation: 12,900 rounds.
✦ "Several thousand" minimizations per optimiztion: 1 x 107 minimizations to produce a final sequence-structure pair.
12
Overview Background Initialzation Iteration Results Conclusion
Protein Top7.
✦ 93 residue protein with no significant BLAST matches.
✦ Modest changes in structure during optimization (RMSD 1.1Å).
✦ Substantial changes in sequence:
Design of a Novel… Kuhlman et al.
- S1 -
Supplementary Online Materials for:
Design of a Novel Globular Protein Fold with Atomic Level Accuracy
Brian Kuhlman*, Gautam Dantas*, Gregory C. Ireton, Gabriele Varani, Barry L.
Stoddard, and David Baker
* These authors contributed equally to this work
Energies and sequence for Top7 before and after alternating cycles of backbone and
sequence optimization.
before DIEITVRINNNGEDYDYKKTATTLSEINAHFEELEKHLKEENGEKITISVKLRNEKEAYW
after DIQVQVNIDDNGKNFDYTYTVTTESELQKVLNELKDYIKKQGAKRVRISITARTKKEAEK
before VAAKIKEQALRAGVETIQIDKQSDTMTATLGKQ
after FAAILIKVFAELGYNDINVTFDGDTVTVEGQLE
Table S1. Energies for Top7 before and after iterative cycles of backbone and sequence
optimization (kcal / mole). Expected Lennard-Jones energies are derived from the
average Lennard-Jones energy for each of the twenty amino acids for different degrees of
burial.
Top7 before
relaxation
Final Top7
model
Lennard-Jones (LJ) attractive -370 -385
13
Overview Background Initialzation Iteration Results Conclusion
Top7 has favorable energies.
✦ Dramatic decrease in L-J repulsive force during optimization.
✦ Modest decreases in other energy measures.
Energy Starting Model Final Model
Lennard-Jones attractive -370 -385
Lennard-Jones repulsive 28 8.6
Hydrogen bonding -89 -80
Solvation energy 188 175
Total energy -324 -386
14
Overview Background Initialzation Iteration Results Conclusion
α/β topology.
Very thermostable.
Unfolds cooperatively.
Denatures in cold.
Typical heat capacity of unfolding.
Typical chemical denaturation.
More stable than most proteins of similar size.
Suggests mixed α helix and β sheet character.
Highly soluble.
Monomeric.
Crystals diffract to 2.5Å.
15
Overview Background Initialzation Iteration Results Conclusion
High backbone similarity overall.
Comparing Top7 model backbone with crystal structure. Model is in blue, actual structure is in red.
Very high similarity at the C-terminus.
View Top7 in 3D.
16
Overview Background Initialzation Iteration Results Conclusion
Conclusion.
✦ Design of heretofor unknown protein folds is possible.
These may yet exist in nature, or they may not.
Next step: larger proteins with multiple folds?
✦ Alternation of sequence design and structure optimization is powerful method.
Allows more flexibility in design of each.
May be applicable to purposeful design (i.e., design for specific function).
May be applicable to ab initio structure prediction, but high sequence variability would be a problem.
✦ New direction for designer proteins.
No longer limited to known structures.
Appears to be fairly efficient design and optimization process.
17