markov state models for protein and rna folding a ...ky974bm1455/gregory_r_bowm… · markov state...
TRANSCRIPT
MARKOV STATE MODELS FOR PROTEIN AND RNA FOLDING
A DISSERTATION
SUBMITTED TO THE PROGRAM IN BIOPHYSICS
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Gregory R. Bowman
July 2010
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/ky974bm1455
© 2010 by Gregory Ross Bowman. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Vijay Pande, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Russ Altman
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Daniel Herschlag
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
iv
ABSTRACT
Understanding the molecular bases of human health could greatly augment our ability
to prevent and treat diseases. For example, a deeper understanding of protein folding
would serve as a reference point for understanding, preventing, and reversing protein
misfolding in diseases like Alzheimer’s. Unfortunately, the small size and tremendous
flexibility of proteins and other biomolecules make it difficult to simultaneously
monitor their thermodynamics and kinetics with sufficient chemical detail. Atomistic
Molecular Dynamics (MD) simulations can provide a solution to this problem in some
cases; however, they are often too short to capture biologically relevant timescales
with sufficient statistical accuracy. We have developed a number of methods to
address these limitations. In particular, our work on Markov State Models (MSMs)
now makes it possible to map out the conformational space of biomolecules by
combining many short simulations into a single statistical model. Here we describe our
use of MSMs to better understand protein and RNA folding. We chose to focus on
these folding problems because of their relevance to misfolding diseases and the fact
that any method capable of describing such drastic conformational changes should
also be applicable to less dramatic but equally important structural rearrangements like
allostery. One of the key insights from our folding simulations is that protein native
states are kinetic hubs. That is, the unfolded ensemble is not one rapidly mixing set of
conformations. Instead, there are many non-native states that can each interconvert
more rapidly with the native state than with one another. In addition to these general
observations, we also demonstrate how MSMs can be used to make predictions about
the structural and kinetic properties of specific systems. Finally, we explain how
MSMs and other enhanced sampling algorithms can be used to drive efficient
sampling.
v
ACKNOWLEDGMENTS
Thanks to my family and my God for giving me the passion, intellect, and opportunity
to do this work. It is difficult to imagine life without the love, support, and training my
parents, brother, and wife have given me. Graduate school—and life in general—have
been much more enjoyable with the companionship of my beautiful wife Angela.
Thanks to my advisor, Vijay Pande, for being such a superb guide, for creating
such an intellectually invigorating environment, and for being so generous with
resources of all kinds. My lab-mates have also been great. I’m especially indebted to
Xuhui Huang for helping to jump-start my progress by working so closely with me
during my rotation and the early years of my PhD. Sergio Bacallado, Kyle
Beauchamp, John Chodera, Dan Ensign, Imran Haque, Peter Kasson, Yu-Shan Lin,
Paul Novick, and Vince Voelz were all great collaborators. Thanks to Jason Wagoner
and Del Lucent for all the conversations about science, religion, politics, and
philosophy.
Thanks to my committee members, Russ Altman and Dan Herschlag, for
making time to help me along the way. Dan was especially generous in including me
in his group and getting me into the wet-lab. Seb Doniach has also been like a co-
advisor.
vi
Table of Contents
List of tables ........................................................................................................................x
List of figures .....................................................................................................................xi
Introduction .........................................................................................................................1
Chapter 1: Using generalized ensemble simulations and Markov state models to
identify conformational states .......................................................................................6
Abstract..........................................................................................................................6
Introduction ...................................................................................................................6
Description of Method.................................................................................................10
Conclusions .................................................................................................................17
Chapter 2: Progress and challenges in the automated construction of Markov state
models for full protein systems ...................................................................................19
Abstract........................................................................................................................19
Introduction .................................................................................................................20
Materials & Methods...................................................................................................24
Results & Discussion...................................................................................................29
Conclusions .................................................................................................................45
Chapter 3: Molecular simulation of ab initio protein folding for a millisecond
folder NTL9(1-39).......................................................................................................47
Abstract........................................................................................................................47
Introduction .................................................................................................................48
Materials & Methods...................................................................................................48
Results & Discussion...................................................................................................49
Conclusions .................................................................................................................55
Chapter 4: Protein folded states are kinetic hubs ..............................................................56
Abstract........................................................................................................................56
Introduction .................................................................................................................57
Results & Discussion...................................................................................................59
Conclusions .................................................................................................................71
vii
Materials & Methods...................................................................................................73
Chapter 5: Atomistic folding simulations of the five helix bundle protein Lambda
..................................................................................................................................75
6-
85
Abstract........................................................................................................................75
Introduction .................................................................................................................76
Results & Discussion...................................................................................................78
Conclusions .................................................................................................................84
Chapter 6: Enhanced modeling via network theory: Adaptive sampling of Markov
state models .................................................................................................................86
Abstract........................................................................................................................86
Introduction .................................................................................................................86
Theoretical Underpinnings ..........................................................................................89
Results & Discussion...................................................................................................96
Conclusions ...............................................................................................................107
Chapter 7: Simulated tempering yields insight into the low-resolution Rosetta
scoring functions .......................................................................................................108
Abstract......................................................................................................................108
Introduction ...............................................................................................................109
Methods .....................................................................................................................111
Results .......................................................................................................................119
Discussion..................................................................................................................128
Conclusions ...............................................................................................................132
Chapter 8: The roles of entropy and kinetics in structure prediction ..............................133
Abstract......................................................................................................................133
Introduction ...............................................................................................................134
Results & Discussion.................................................................................................136
Conclusions ...............................................................................................................144
Materials & Methods.................................................................................................145
Chapter 9: Structural insight into RNA hairpin folding intermediates............................148
viii
Abstract......................................................................................................................148
Introduction ...............................................................................................................148
Results & Discussion.................................................................................................150
Conclusions ...............................................................................................................156
Chapter 10: Rapid equilibrium sampling initiated from non-equilibrium data ...............157
Abstract......................................................................................................................157
Introduction ...............................................................................................................158
Results & Discussion.................................................................................................162
Conclusions ...............................................................................................................170
Materials & Methods.................................................................................................171
Appendix A: Estimating transition matrices and equilibrium distributions....................172
Appendix B: The possibility of longer timescales than the implied timescales..............175
Appendix C: Supporting information for chapter 3 ........................................................177
Molecular dynamics simulation ................................................................................177
Markov State Model (MSM) construction ................................................................178
Transition Pathway Theory (TPT) analysis...............................................................178
Structural analysis of macrostate ensembles .............................................................179
Analysis of states along folding pathways: comparison between secondary
structure formation and reaction progress (p ).................................................180 fold
How does NTL9 fold in our simulations? .................................................................181
Appendix D: Supporting information for chapter 4 ........................................................187
Villin MSM ...............................................................................................................187
Simple models ...........................................................................................................191
Appendix E: Supporting information for chapter 5.........................................................199
Simulation Details .....................................................................................................199
MSM Construction and Analysis ..............................................................................199
Appendix F: Supporting information for chapter 6.........................................................210
Appendix G: Supporting information for chapter 9 ........................................................211
Serial Replica Exchange (SREMD) ..........................................................................211
ix
Simulation Details .....................................................................................................211
Topological Method (Mapper) for Pathway Analysis...............................................212
PEDFs........................................................................................................................213
Melting Curves ..........................................................................................................214
Appendix H: Supporting information for chapter 10 ......................................................216
Initial Configurations.................................................................................................216
The Convergence of Weights in Simulated Tempering (ST)....................................216
Molecular Dynamics (MD) Simulation Details ........................................................220
Hierarchical K-medoids clustering algorithm ...........................................................220
Markov State Models ................................................................................................221
A simple model of non-Arrhenius, metastable dynamics .........................................226
Bibliography ....................................................................................................................236
x
LIST OF TABLES
Number Page
Table 1. Exponential fits, MFPT’s, and lag phases (all in units of steps) for
transitioning from the unfolded state(s) to the native state in the three
simple models. ...............................................................................................198
Table 2. Convergence of the weights is shown for representative temperatures Δg
= g − g obtained from distributed computing simulations starting from
a helical structure (third column) and a coil structure (fourth column) at
different temperature pairs. Differences between free energy
differences Δf = g /β −g /β obtained from simulations starting from a
helical structure and a coil structure are displayed in the 5th column.
KT at temperature i is shown in the sixth column. Δf (Helical)-
Δf (coil)(KJ/mol) is smaller than KT (KJ/mol) at all temperature pairs. .....221
j i
ji j j i i
ji
ji
Table 3. Metastability (Q) and average self-transition probability <P > between
metastable states for the MSMs built from ST simulations and seeding
simulations.....................................................................................................225
i i
xi
LIST OF FIGURES
Number Page
Figure 1. Schematic of the steps required for building an MSM and obtaining
representative conformations for each state. First, GE data represented
by points are grouped into microstates represented by circles, with
darker circles for more highly populated microstates. Kinetically
related microstates are then lumped together into macrostates, or
metastable states, represented by amorphous shapes. Finally,
representative conformations are obtained by extracting the most
probable conformation from each macrostate. ................................................10
Figure 2. Implied timescales as a function of the lag time. There are two probable
gaps in the implied timescales. If gap one were selected then a
macrostate MSM with four states would be constructed whereas if gap
two were selected a higher resolution MSM with 6 states would be
constructed.......................................................................................................14
Figure 3. Scatter plots of the free energy of each microstate (in kcal/mol) versus its
RMSD. A) The initial 10,000 state model, B) the 30,000 state model,
C) the final 10,000 state model, and D) the final 10,000 state model
except that the average RMSD across five structures in each state is
used instead of the RMSD of the state center..................................................31
Figure 4. Top ten implied timescales for the initial 10,000 state model. ..........................31
Figure 5. Three representative structures for A) the lowest RMSD state in the final
model and B) the most probable state in the final model overlaid with
the crystal structure (red). The phenylalanine core is shown explicitly
for each molecule. ...........................................................................................35
Figure 6. Top ten implied timescales for the final model. A) The implied
timescales at intervals of one ns. B) The implied timescales with error
xii
bars obtained by doing five iterations of bootstrapping at an interval of
five ns. .............................................................................................................38
Figure 7. The average RMSD of each state in the final model versus its left
eigenvector component in the longest timescale transition showing that
this transition corresponds to folding. .............................................................39
Figure 8. Comparison between the time evolution of the native population in the
MSM (blue) and the raw data (black) for the entire dataset. The error
bars represent the standard error......................................................................40
Figure 9. Comparison between the time evolution of the RMSD in the MSM
(blue), the reduced representation (yellow), and the raw data (black) for
A) an example of good agreement and B) an example of the worst case
scenario. The error bars represent one standard deviation in the RMSD. .......42
Figure 10. Improved agreement between the MSM and raw data for the example of
poor agreement from Figure 6B obtained by building the transition
probability matrix from simulations started from this starting structure
alone. The error bars represent one standard deviation in the RMSD.............44
Figure 11. (a) Distributions of RMSD-C for native-state simulations of NTL9(1-
39) after 10 µs. The arrows indicate thresholds defined for the native
basin at 3.5Å and 4Å. (b) The number of parallel simulations M(t)
started from unfolded states at 370K that reach time t. (c) Posterior
predictions of the folding rate given the amount of simulation time and
observed folding events for 3.5Å (dashed) and 4Å (solid) thresholds,
using uniform (black) and Jeffrey’s (gray) priors, using methods from
(85). In red is a Gaussian distribution representing the experimental rate
mean and standard deviation. ..........................................................................50
Figure 12. (a) A snapshot from a folding trajectory (dark blue) achieves an RMSD-
C of 3.1Å compared to the native state (cyan). (b) Non-native (top)
and native-like (bottom) hydrophobic core arrangements observed in
low-RMSD conformations of folding trajectories. Highlighted are
xiii
sidechains of residues F5 (magenta), V3,V9,V21 (tan), and L30,L35
(pink). ..............................................................................................................51
Figure 13. A 2000-state Markov State Model (MSM) was built using a lag time of
12 ns. Shown is the superposition of the top 10 folding fluxes,
calculated by a greedy backtracking algorithm (see Appendix C). These
pathways account for only about 25% of the total flux, and transit only
14 of the 2000 macrostates (shown labeled a-n, for convenient
discussion). The visual size of each state is proportional to its free
energy, and arrow size is proportional to the inter-state flux. .........................52
Figure 14. The 14 macrostates involved in the top ten folding pathways, plotted
along structural and kinetic reaction coordinates. The balance between
native-like helix and sheet structure is quantified by Q – (Q +
Q )/2 (vertical axis), and progress along the folding reaction is
quantified by the p (committor) value (horizontal axis). It can be
seen that the “unfolded” state (a) contains residual native-like helical
propensity, and that pathways involving various ordering of native-like
helix and sheet formation are possible. ...........................................................54
α β12
β13
fold
Figure 15. Q-values, which capture the extent of native-like structures, plotted
versus p (committor) values. The lines are to guide to eye. .......................54 fold
Figure 16. Three representative networks each having unfolded state(s) (U and
U ), intermediates (I ), and a native state (N). S has a single pathway, P
has parallel pathways, and H has a heterogeneous unfolded state. .................61
i i
Figure 17. Distributions of the first folding times for the simple networks S, P, and
H are shown in panels A, B, and C respectively. The blue lines are
exponential fits to the data after the initial lag phase. .....................................62
Figure 18. Relaxation of villin from 500 state model. Distributions of the MFPTs
from (A) unfolded states to the native state and (B) between unfolded
states. (C) Relaxation kinetics with a 10:1 signal-noise ratio (black
xiv
curve with Gaussian noise) and a single exponential fit (blue curve with
τ≈810 ns). ........................................................................................................64
Figure 19. Schematic diagrams of funnel and native hub models having unfolded
states (U), intermediates (I), and native states (N). (A) A network
description of a folding funnel with nodes corresponding to individual
conformations and a bottleneck near the native state. (B) A native hub
model with metastable nodes. The size of each node in (B) is correlated
with its equilibrium probability and the connectivity falls off as one
moves away from the native state. ..................................................................67
Figure 20. Distance between the final villin MSM and MSMs constructed from
subsets of the data (varying trajectory length and number of
trajectories). Distance is measured by a relative entropy metric (see
Appendix D for details). Black lines are contours of equal amounts of
data. No data was available for the upper-right portion of the graph..............70
Figure 21. (A) The crystal structure of the λ dimer bound to DNA (PDB code
1LMB). (B) A model of λ with the Trp22-Tyr33 pair monitored in
T-jump experiments space-filled. ....................................................................77
1-92
6-85
Figure 22. One of the 10 millisecond timescale pathways labeled with p values
(the probability of reaching state H before state A). .......................................80
fold
Figure 23. The 500 most populated macrostates with sizes proportional to their
free energies and connections between states if transitions between
them occurred in our simulations. The native state (green state with
green connections) is a hub. The crystallographic state from Figure 22H
is blue, the compact β-sheet state from Figure 22A is red, and the
remaining states are yellow. All of these states have smaller
equilibrium populations and fewer connections than the native state. ............83
Figure 24. Distributions of mean first passage times (MFPTs) between sets of
microstates (A) without weighting the distribution and (B) weighting
each MFPT by the equilibrium probability of the starting state. The
xv
solid line is the distribution of MFPTs from non-native to native
microstates and the dashed line is the distribution of MFPTs between
non-native states. The average MFPT from non-native states to native
ones is about 10 times faster than that between non-native states in (A)
and the difference is even greater in (B). Native microstates were
defined as those in the most populated macrostate. All other microstates
were considered non-native. ............................................................................83
Figure 25. Scaling for adaptive sampling of villin as the number of parallel
simulations (N) used during each round is varied. (A) Wall-clock time
scaling as N is varied. The black line is a best fit to the linear portion of
the data (circles), which extends up to 5,000 simulations per iteration.
(B) Computer time required to achieve a given model quality (relative
entropy) for various sampling schemes. L refers to one long trajectory
and the numbers refer to the number of parallel simulations used in
each iteration of adaptive sampling. All results come from averaging
over ten independent runs. Each step equates to 15 ns....................................98
Figure 26. (A) The two models, S and P. (B) Distance from the true model
(measured via the relative entropy) as a function of wall-clock time for
adaptive sampling versus one long simulation of S (assuming 5
steps/day to mimic 5 nanoseconds/day in protein folding simulations).
The lines are one long simulation (dashed line) and adaptive sampling
with 10 simulations of 20 steps (solid line), 10 simulations of 200 steps
(dotted line), 100 simulations of 20 steps (dash-dot line), and 1000
simulations of 20 steps (black squares) per iteration.....................................100
Figure 27. Relative entropy (top) and free energy of each state in kcal/mol
(bottom) as a function of the adaptive sampling iteration on model S..........102
Figure 28. Distance from the true model (measured via the relative entropy) as a
function of the number and length of simulations averaged over 10
independent samples. (A) Reference distribution for S, (B) adaptive
xvi
sampling of S, (C) reference distribution for P, and (D) adaptive
sampling of P. All simulations for the reference distributions started
from state 1. The first 10 simulations for adaptive sampling started
from state 1 and subsequent batches of simulations started from the
state contributing most to uncertainty in the slowest process. Black
lines are contours of equal amounts of data. .................................................103
Figure 29. Scaling for adaptive sampling of our simple models as the number of
parallel simulations (N) used during each round is varied. (A) and (B)
Wall-clock time scaling as N is varied for simple models S and P
respectively. The black line is a best fit to the linear portion of the data
(circles). (C) and (D) Computer time required to achieve a given model
quality (relative entropy) for various sampling schemes applied to S
and P respectively. L refers to one long trajectory and the numbers refer
to the number of parallel simulations used in each iteration of adaptive
sampling. All results come from averaging over ten independent runs. .......105
Figure 30. Flow chart showing the order the scoring functions are used in and
giving brief descriptions of each. After score5, Rosetta returns to
score2 five times before progressing to score3. The first six scoring
functions constitute the low-resolution de novo structure prediction
phase. .............................................................................................................113
Figure 31. Score versus RMSD (Å ) for an SH3 domain (PDB code 1shf). Each
diamond represents the lowest scoring structure for a single run. Data
for ST is shown in blue while data for standard Rosetta is shown in red.
The black ‘‘+’’ symbols represent models obtained by idealizing and
relaxing the crystal structure in low-resolution mode. ..................................120
Figure 32. Score versus RMSD (Å ) for protein G (PDB code 1igd). Each diamond
represents the lowest scoring structure for a single run. Data for ST is
shown in blue while data for standard Rosetta is shown in red. Panel
(A) shows results from the low-resolution phase. The black ‘‘+’’
xvii
symbols represent models obtained by idealizing and relaxing the
crystal structure in low-resolution mode. Panel (B) shows results from
the full-atom phase. The yellow circles represent models obtained by
idealizing and relaxing the crystal structure in full-atom mode. The
black ‘‘*’’ symbols are full-atom models obtained by relaxing the low-
resolution structures depicted by ‘‘+’’ symbols in (A) using the full-
atom scoring functions. .................................................................................121
Figure 33. Evolution of the score4 weights for protein G. The dashed line is the
difference between the weights of the highest two temperatures: 10 and
20 kT. The solid line is the difference between the weights of the
lowest two temperatures: 0.1 and 0.25 kT. The first points come from
constant temperature runs and subsequent points represent each
iteration of refining the weights. Δg=g -g where, j > i.................................123 j i
Figure 34. Projections of the free energy landscape onto score versus RMSD (Å )
for protein G in score4 using: (A) standard Rosetta runs starting from
an extended chain, (B) standard Rosetta runs starting from the native
state, (C) ST runs at 0.1 kT starting from an extended chain, (D) ST
runs at 0.1 kT starting from the native state, (E) ST runs at 2 kT starting
from the native state, (F) ST runs at 20 kT starting from the native
state. Each white plus-sign corresponds to the lowest scoring structure
for a single run. The lowest scoring structures from each run were
sorted by RMSD and only every twentieth point is shown so as to give
the entire range without obscuring the underlying plot.................................124
Figure 35. Projections of the free energy landscape onto score versus RMSD (Å )
for protein G. Each white plus-sign corresponds to the lowest scoring
structure for a single run. The lowest scoring structures from each run
were sorted by RMSD and only every twentieth point is shown so as to
give the entire range without obscuring the underlying plot. (A), (D),
(G), and (J) show data from standard Rosetta runs with frequent
xviii
recovery of the lowest scoring structure in score1, score2, score5, and
score3 respectively. (B), (E), (H), and (K) show data from standard
Rosetta runs without frequent recovery of the lowest scoring structure
in score1, score2, score5, and score3 respectively. (C), (F), (I), and (L)
show data from ST runs at 0.1 kT without frequent recovery of the
lowest scoring structure in score1, score2, score5, and score3,
respectively....................................................................................................127
Figure 36. Time evolution of the C RMSD of the current umbrella center for five
representative simulations demonstrating the presence of reversible
folding............................................................................................................137
α
Figure 37. Average energy (<∆E>), conformational entropy (<∆S>), and free
energy (<∆F>) as a function of C RMSD for protein G and engrailed
homeodomain (EH). ......................................................................................138
α
Figure 38. Average free energies (<∆F>) as a function of C RMSD for
temperatures of 0.5 and 0.1 for protein G and engrailed homeodomain
(EH). The black lines are the hypothesized free energy at the given
temperature and the dash-dot lines are the free energy at temperature
0.8 shown for reference. ................................................................................140
α
Figure 39. (A) The native structure of protein G and (B) the 5.7 Å starting
structure used for comparing the ST and Standard Rosetta variants.............142
Figure 40. Distribution of the minimum C RMSD values reached by 100
Simulated Tempering (ST) and 100 standard Rosetta runs started from
a 5.7 Å structure. Results for both the low temperature and standard
Rosetta variants were identical so only a single plot is shown......................142
α
Figure 41. Relative magnitude of the average hydrogen bonding energy (solid line)
versus the total average energy (dash-dot line) as a function of Cα
RMSD for protein G and engrailed homeodomain (EH). .............................143
xix
Figure 42. (A) NMR structure of the GCAA tetraloop. (B) Contact map for the
native state. Bases are numbered from 5’ to 3’ and native base-pair
contacts (dotted lines) are numbered 1-4.......................................................149
Figure 43. The probability of a given number of native contacts during (A)
unfolding and (B) refolding. (C) The probability of each contact when a
given number of contacts are present during unfolding and refolding
with the arrows representing the direction of movement between the
unfolded state (U) and the folded state (F). ...................................................153
Figure 44. Contact maps representing the cluster centers from independent
clustering of the unfolding (A) and refolding data (B). The grey lines
represent the connectivity of the states. The blue lines represent native
contacts with a probability of 0.6 or greater within the cluster.
Intermediate structures are labeled A-D........................................................153
Figure 45. Representative full-atom structures for the intermediate states with
labels (A)-(D) corresponding to the labels A-D in Figure 3. ........................155
Figure 46. A schematic free energy landscape with three representative seeding
trajectories started from each basin and a projection of this free energy
landscape onto a 2D plain showing the division into metastable states. .......161
Figure 47. Schematic of the adaptive seeding scheme. The top arrow represents
our ST trajectories, which are split into equilibration (green) and
production (light blue) phases. The light red and light yellow boxes
encompass our long and short adaptive seeding schemes respectively.
For each adaptive seeding scheme, the dotted lines demark the portion
of the ST data used to identify the dominant thermodynamic, or
metastable, states by building an MSM (S). Constant temperature (or
canonical, NVT) simulations are then started from each state and used
to build a new MSM (E) that captures the equilibrium distribution.
Both the light yellow and red boxes also encompass a portion of the
original ST data that is equivalent to the amount of sampling used in
xx
the adaptive seeding scheme. An MSM is also built for this data and
used as a baseline for judging the efficiency of the adaptive seeding
scheme. ..........................................................................................................163
Figure 48. Population of each state (bar graphs correspond to the mean values, and
error bars stand for standard deviations) for (A) the long adaptive
seeding scheme (lag time t=4.5 ns) and (B) the short adaptive seeding
scheme (lag time t=4.5 ns).............................................................................165
Figure 49. Population of each state for the long adaptive seeding scheme as the lag
time is varied. ................................................................................................166
Figure 50. Representative structure for each of the six metastable states. The
numbering is the same as in Figures 48 and 49.............................................170
Figure 51. Graph depiction of the model system defined in Appendix B with edges
labeled by A) their probability and B) their average timescale under a
two-state assumption. ....................................................................................176
Figure 52. (a) Implied timescales for a series of 100,000-microstate Markov State
Models (MSMs) built at lag times between 1 and 32 ns. As the longest
timescale levels off beyond a lag time of 10 ns, a lag time of 12 ns was
chosen to build subsequent MSMs. The spectral gap present at all lag
times indicates apparent two-state folding kinetics. (b) The implied
timescales for a 2000-macrostate model built by lumping states from
the microstate MSM show a similar spectral gap and leveling off of
time scales. The faster implied timescales of the macrostate model at
short lag times are due to lumping effects. (c) The 10 slowest implied
timescales for the 2000 state models, with error analysis from a
bootstrapping procedure. Error bars represent the standard deviation
from the bootstrap analysis............................................................................183
Figure 53. A scatter plot of the 2000 macrostates obtained by lumping the
100,000-state MSM calculated from the simulation data at 370K. The
RMSD-to-native is calculated using the peptide backbone residues,
xxi
with respect to the native starting state. The free energy of each
microstate i is computed as –kT ln (p /p ), where p is the equilibrium
probability of the microstate, and p is an arbitrary reference (in this
case, max(p )). Shown in red are the 14 macrostates transited by the top
ten pathway fluxes, labeled with the same letters as in Figure 13. In this
mesoscopic view, we find that 1) the macrostates are diffuse collections
of conformational states, 2) there are multiple folding pathways along
these metastable states, and 3) we can identify highly populated
“native” (state n) and “unfolded” (state a) macrostates that dominate
the observed relaxation rates. The red arrow is meant to guide to eye in
illustrating a “mesoscopic” view of the transition state barrier: the
“unfolded” state (a) and “native” state (n) are at free energy minima,
while intermediate RMSD values have macrostates with higher free
energies..........................................................................................................184
i 0 i
0
i
Figure 54. Contact profile subspaces used to calculate Q , Q , and Q , which
quantify the extent of native-like structuring for beta-strand and
pairing, beta-strand and pairing, and helix formation,
respectively....................................................................................................184
12 13
1 2
1 3
Figure 55. Here, contact profiles (see definition above) for the 14 macrostates
involved in the top ten folding pathways are plotted in a similar fashion
to Figure 55. For clarity, the pathway arrows have been removed. Each
contact profile is a 39 x 39 matrix of inter-residue contacts, showing the
contact fraction on a linear grayscale from 0 (white) to 1 (black). ...............185
Figure 56. Here, values of Q (yellow), Q (red), and Q (blue) are plotted in a
bar graph for each of the 14 macrostates involved in the top ten folding
pathways. The layout is in a similar fashion to Figure 56.............................185
12 13
Figure 57. Macrostates l, m and n (the “native” state) have very similar structural
ensembles and similar p values (p > ~0.93). To examine the
subtle differences in their macrostate contact profiles, we computed
fold fold
xxii
difference contact profiles for (l-m), (n-l) and (n-m) transitions. These
difference maps reveal that these states differ mostly in their hairpin
registrations and packing of the hairpin loop. ...............................................186
Figure 58. Implied timescales for the villin macrostate MSM........................................194
Figure 59. Distribution of MFPTs between all pairs of non-native states for villin
(A) on a linear scale to demonstrate the peak does not shift significantly
relative to the distribution shown in Figure 18B and (B) on a log scale
to highlight that the tail of the distribution does extend to about 60 ns. .......194
Figure 60. Distributions of the MFPTs (A) from each non-native state to the native
state and (B) between every pair of non-native states for our 2,000 state
NTL9(1-39) model. As discussed in Ref (93), further refinement of
this model is likely necessary. However, we do not expect the
qualitative trend of long timescales (relative to folding) for
transitioning between unfolded states to change. ..........................................195
Figure 61. Two conformations from different unfolded basins demonstrating the
structural heterogeneity of non-native states (especially in their non-
native contacts) that, in combination with the vastness of
conformational space, result in slow transitions between unfolded
states. The structures are colored red to blue from the N-terminus to
the C-terminus. Atoms for residues Arg 14, Trp 23, and Lys 32 are
shown to highlight that 23 and 32 are in contact on the left while the
chain has rearranged such that 14 and 32 are in contact on the right.
These images were made with VMD (67).....................................................195
Figure 62. Relaxation of the fraction folded starting from equally populated
unfolded states (black is data and blue is single exponential fit with
τ≈810 ns). The beginning of the curve is dominated by single
exponential relaxation but deviations from this apparent two-state
behavior become apparent later.....................................................................196
xxiii
Figure 63. Relaxation of the fraction unfolded for a villin model at the microstate
level (thick black line) and a biexponential fit (thin blue line) with time
constants of ~60 and ~415 ns, at least qualitatively consistent with time
constants of ~70 and ~720 ns from experiment (56). We hope to
explain this behavior in a future work on villin. As in Ref. (4), the
native state was defined as all microstates with an average C RMSD to
the crystal structure less than 3 Å..................................................................197
α
Figure 64. The distance to the gold-standard model, measured via the relative
entropy, for 40,000 trajectories up to 400 nanoseconds in length. The
black lines are contours of equal amounts of data. Again, there was
insufficient data to resolve the upper right-hand corner of the plot. .............198
Figure 65. Implied timescales for the full 370 K dataset. ...............................................202
Figure 66. Implied timescales for the 300 K dataset. ......................................................202
Figure 67. Implied timescales for ¾ of the 370 K dataset selected at random................203
Figure 68. A coarse-grained view of the slowest transition with state sizes
proportional to the free energy and arrow widths proportional to the
flux (see key in figure). .................................................................................203
Figure 69. Another coarse-grained view of the slowest transition with state sizes
proportional to the free energy and arrow widths proportional to the
flux (see key in figure). Here the states are laid out in terms of the
average number of β-sheet residues (calculated from 100 random
conformations from each state) and the p (probability of reaching the
crystallographic state in L before the compact β-sheet state in A)................204
fold
Figure 70. Free energy projections of the microstate MSM onto typical order
parameters like the radius of gyration (Rg), the C RMSD to the crystal
structure, and the distance between the Trp22 and Tyr33 residues.
Differences between the two panels highlight the difficulty in
interpreting such projections. ........................................................................206
α
xxiv
Figure 71. Free energy projection of the microstate MSM onto Pfold and the
distance between the Trp22 and Tyr33 residues. Obtaining projections
onto kinetic order parameters like Pfold is greatly simplified with
MSMs. In this case Pfold refers to the probability of reaching the
crystallographic state before reaching the compact β-sheet state (i.e. the
slow transition from Figure 21). Unlike the projections in, this one
hints that D14A may not be well described by a simple two- or three-
state model or that the Trp22-Tyr33 distance is not a good reaction
coordinate, since there are a broad range of Pfold values possible for a
given Trp-Tyr distance. Indeed, analysis of the MSM reveals that
D14A is best described by a native hub. .......................................................206
Figure 72. The ten most populated macrostates with their equilibrium probabilities. ....207
Figure 73. Relaxation of the fraction unfolded with different observables and
observation times. The thick black curves come from the MSM and the
thin blue curves from biexponential fits to the MSM relaxation. The top
row shows relaxation of the fraction unfolded measured by the Trp22-
Tyr33 distance (A) starting from all states being equally populated and
(B) starting from all non-native states being equally populated. The
bottom row shows relaxation of the fraction unfolded measured by the
C RMSD to the crystal structure (C) starting from all states being
equally populated and (D) starting from all non-native states being
equally populated. Fitting parameters are given in the figure (in units of
microseconds). In this case, the fitting parameters are relatively
independent of the observable and starting distribution................................207
α
Figure 74. Relaxation of the fraction unfolded with different observables and
observation times from an MSM built without the trajectories started
from β-sheet structures. The thick black curves come from the MSM
and the thin blue curves from biexponential fits to the MSM relaxation.
The top row shows relaxation of the fraction unfolded measured by the
xxv
Trp22-Tyr33 distance (A) starting from all states being equally
populated and (B) starting from all non-native states being equally
populated. The bottom row shows relaxation of the fraction unfolded
measured by the C RMSD to the crystal structure (C) starting from all
states being equally populated and (D) starting from all non-native
states being equally populated. Fitting parameters are given in the
figure (in units of microseconds). In this case the fitting parameters are
more dependent on the observable, consistent with the experimental
observation of probe dependent kinetics. ......................................................208
α
Figure 75. Projection of the free energy onto pfold (A) from the compact β-sheet
state in Figure 22A to the native state in Figure 22H, (B) from the
extended state in Figure 22E to the native state in Figure 22H, and (C)
from the extended state in Figure 22E to the native state in Figure 22G.
None are purely downhill, though some may be consistent with
incipient downhill folding (i.e. have sufficiently low barriers that there
is a reasonable population at the barrier top that can fold in a downhill
manner in addition to activated folding across the barrier). ..........................209
Figure 76. The helicity of each residue predicted from Agadir.(143) The purple,
numbered bars show where the five helices are (the extra purple block
between helices 4 and 5 is a turn)..................................................................209
Figure 77. Uncertainty in the log base 10 of the relative entropies averaged over 10
independent samples of (A) reference simulations of M1 and (B)
adaptive sampling of M1. Black lines are contours of equal amounts of
data. ...............................................................................................................210
Figure 78. Uncertainty in the log base 10 of the relative entropies averaged over 10
independent samples of (A) reference simulations of M2 and (B)
adaptive sampling of M2. Black lines are contours of equal amounts of
data. ...............................................................................................................210
xxvi
Figure 79. (a) Potential Energy Distribution Functions (PEDFs) generated from
Folding@home data at each of the 56 temperatures used. (b). The
convergence measure averaged over all temperatures as a function of
time. Triangles correspond to using P as the reference distribution
and circles correspond to using P as the reference. ................................214
2
final
initial
Figure 80. Native contacts melting curve. Only every third temperature is
displayed for clarity. ......................................................................................215
Figure 81. The two initial structures used in this study: A) A near-native
conformation and B) a random coil conformation. .......................................216
Figure 82. Amount of sampling at different temperatures for ST simulations
started from the native (top row) and coil configurations (bottom row)
computed from different segment of simulation time 0-0.3ns, 1.2-1.5
ns, 2.7-3.0 ns, and 8.7-9.0ns are displayed. Uniform sampling is
reached for both sets of ST simulations indicating the weights are
converged. .....................................................................................................220
Figure 83. Three example structures from a single microstate. ......................................224
Figure 84. The largest one hundred implied timescales as a function of the lag time
for (a) ST simulations starting from the coil initial configuration. (b)
The long adaptive seeding microstate MSM. ................................................225
Figure 85. Potential of Mean Force (PMF) for the simple potential at (1/KT) a.
0.995, b. 0.652, and c. 0.456. In part a, four metastable macrostates are
separated by the dashed black lines and labled. ............................................228
Figure 86. Populations of four macrostates as function of =1/kT. ................................229
Figure 87. Folding (black) and unfolding (red) rates are plotted as a function of
=1/kT. ..........................................................................................................230
Figure 88. Logarithms of the implied timescales as function of for the 2D
potential are displayed. The three slowest timescales are plotted using
up triangle, down triangle, and cross points respectively..............................231
xxvii
Figure 89. Populations computed from Simulated Temperating (ST) simulations
for four metastable states of the are plotted as a function of length of
the simulation. The reference populaiton is shown in the solid lines and
1000 trajectories are used for this calculaiton. The error bars are the
standard derivation obtained from bootstrapping 100 times with
replacement....................................................................................................232
Figure 90. Populations computed from Adaptive Seeding Method (ASM) for four
metastable states of the are plotted as a function of length of the
simulation. The reference populaiton is shown in the solid lines and
1000 trajectories are used for this calculation. The lag time is selected
as 1/3 of the length of the simulation. The error bars are standard
derivation obtained from a Bayesian method (See section 2.5.3 for
details). ..........................................................................................................233
Figure 91. Populations computed from ASM simulations for four metastable states
as a function of lag time. ...............................................................................234
Figure 92. Number of steps taken to reach the convergence as a function of
number of trajs...............................................................................................235
1
INTRODUCTION
Molecular kinetics plays fundamental roles in human health and disease. For example,
conformational changes in the ribosome drive translation and many drugs work by
inducing allosteric conformational changes in G protein-coupled receptors. Many
neurological diseases, like Alzheimer’s, are also hypothesized to result from protein
misfolding. Therefore, a deeper understanding of molecular kinetics is crucial for our
ability to comprehend and control human health.
Protein folding is a classic grand-challenge in molecular biophysics because it
is such a dramatic example of molecular kinetics and has important medical
implications. With the recent discovery of structured RNAs, RNA folding has also
become of interest. Folding is the process by which a disordered chain of residues
(either amino acids or nucleotides) spontaneously self-assembles into a specific three-
dimensional shape. The fact that folding happens at all is astounding given the
enormous number of possible conformations a protein or RNA can adopt. For
example, a hypothetical protein with 100 residues, each of which could adopt two
conformations, could fold into over 1,000,000,000,000,000,000,000,000,000,000
different structures. If such a protein visited one conformation/second then reaching
all of them would take over 1,000,000,000,000 times longer than the age of the
universe. Moreover, real proteins have many more degrees of freedom and sometimes
many more residues. Despite this, proteins can often fold in a matter of milliseconds to
seconds and RNA folding is only moderately slower. Therefore, it is reasonable to
conclude that there must be one or more pathways guiding a biomolecule to its
native—or most probable—state. Because folding is such a dramatic conformational
change, any method that could map out the pathways by which protein and RNA
molecules fold would likely be a powerful means of understanding less drastic but
equally important structural rearrangements like allostery, all of which fall into the
general category of molecular kinetics. In addition, accurate models for protein folding
2
would serve as a reference point for understanding, preventing, and reversing from
misfolding diseases.
Many experimental techniques have been developed to probe folding.
Unfortunately, biomolecules are extremely sensitive to their underlying chemical
details and no current experimental method can simultaneously describe the atomic
details of a molecule’s thermodynamics and kinetics. For example, x-ray
crystallography can provide atomistic snapshots of a protein’s structure but gives little
information about its kinetics. FRET, on the other hand, can provide information about
a protein’s structure and dynamics by reporting on the changing distance between two
probes attached to a molecule but is blind to the rest of that molecule’s structure.
Heterogeneity also complicates the interpretation of much experimental data.
Molecular dynamics (MD) simulations are a powerful means of simultaneously
modeling a biomolecules thermodynamics and kinetics with atomic resolution. In an
MD simulation, one explicitly represents every atom and the bonds between them.
One can then iteratively update the position and velocity of each atom based on the
force exerted on it by the rest of the simulated system. The resulting trajectory is like a
movie taken by zooming in on a single protein (or some other biomolecule).
Unfortunately, MD has many of its own challenges. First and foremost among
these is the sampling problem. Atomistic MD simulations must take very small
timesteps (on the order of femtoseconds) to avoid unphysical phenomena like atoms
passing through one another. Therefore, a typical computer can only simulate ~5
nanoseconds/day even for a small protein and would take over 500,000 years to
simulate one second. In addition, molecular kinetics are stochastic, so generating a
single long simulation is inadequate for truly understanding processes like protein
folding. Instead, one must witness numerous events to characterize the entire
distribution of pathways by which they can occur. Moreover, even if one could run a
sufficient number of long simulations, the task of analyzing this data and making a
direct connection with experiments would still remain. And, of course, the validity of
3
the results of any simulation depends on the accuracy of the approximations and
parameters (together referred to as the force field) used to describe the interactions
between atoms. Unfortunately, testing a force field requires obtaining sufficient
sampling and comparing the results to a large body of experimental data, so selecting
(or developing) a good force field is non-trivial at best.
Networks called Markov state models (MSMs) are one potential solution to
these problems (1-3). An MSM is essentially a map of a molecule’s conformational
space built from MD simulations. That is, like a road map with cities labeled with
populations connected by roads labeled with speed limits, MSMs give the probability
that a protein or other molecule will be in a certain set of conformations (called a
metastable state) connected by edges describing where it can go next and how quickly.
MSMs are typically constructed from simulation trajectories (3-8). Because of
the temporal relationship between conformations in a trajectory, it is possible to group
conformations that can interconvert rapidly into states and then determine the
connectivity between states by counting the number of times a simulation went from
one state to another. By employing these kinetic definitions, one ensures that the
system’s dynamics can be modeled reasonably well by assuming stochastic transitions
between states (1, 3-6, 9-12). Thus, it is possible to perform analyses, such as
identifying the most probable conformations at equilibrium or modeling the relaxation
of some experimental observable, and make a quantitative comparison to (or
predictions of) experiments. In addition, one can naturally vary the temporal and
spatial resolution of an MSM by changing the definition of what it means to
interconvert rapidly or slowly (4, 5, 10, 13, 14), much like zooming in and out on a
Google map. By choosing a long timescale cutoff, one can obtain humanly
comprehensible models with just a few metastable (or long-lived) states that capture
large conformational changes, like folding. Such coarse-grained models are useful for
gaining an intuition for a system. With a short timescale cutoff, on the other hand, one
can obtain a model with many states. By using such high resolution models, one
sacrifices ease of comprehension for more quantitative agreement with experiments (4,
4
5, 15). Regardless of the resolution, one can also draw on network theory to analyze
MSMs and gain important insights into processes like folding (16, 17). Thus, MSMs
are a powerful way of analyzing simulation data sets.
MSMs also provide a statistical approach to molecular simulation—and
potentially other problems exhibiting metastability (18). Rather than attempting to
generate one realization of an entire process, one instead decomposes conformational
space into multiple metastable states and seeks to gather statistics on each step of the
process independently and in parallel (e.g. by running many short simulations from
each state and then combining them into a single MSM). Adaptive sampling
algorithms for MSM construction take this statistical approach a step further (12, 18-
20). In adaptive sampling, one first obtains an initial model of the entire process of
interest by any means possible. One then iteratively calculates the contribution of
each step of the process to uncertainties in some observable of interest via Bayesian
statistics and runs numerous parallel simulations of the steps that can lead to the
greatest increases in precision until the desired level of statistical certainty is achieved.
Such an approach was recently shown to lead to dramatic reductions in the statistical
uncertainty in the observable of interest relative to other refinement schemes (19).
More recently, we have shown that it leads to efficient improvement of the global
model quality (18). Once a converged sampling is obtained, MSMs at varying
resolutions can be used to asses the validity of the underlying force field by making
quantitative comparisons to existing data and predictions of new experiments.
Therefore, one can gain new insight into processes like protein folding, or at least
understand and correct errors in the force field.
Here we describe how MSMs can be used to understand protein folding
(and related problems in molecular kinetics) and connect to experiments. We begin
with an introduction to MSMs and a software package we developed to automate the
construction of these models from simulation data sets. Next, we describe initial
applications of this software to small model systems (a 35 residue mutant of the villin
headpiece and a 39 residue fragment of NTL9) to test this methodology. We then
5
describe new insights into protein folding obtained from MSMs and their application
to larger, more biologically relevant systems like λ repressor (an 80-residue protein).
This discussion is followed by an explanation of how MSMs can be used to solve the
sampling problem using adaptive sampling and other enhanced sampling algorithms.
Within this discussion of sampling, we also describe some of the initial applications of
MSMs to RNA folding.
6
CHAPTER 1: USING GENERALIZED ENSEMBLE SIMULATIONS AND
MARKOV STATE MODELS TO IDENTIFY CONFORMATIONAL STATES
This chapter was taken from: Bowman GR, Huang X, & Pande VS (2009) Using
generalized ensemble simulations and Markov state models to identify conformational
states Methods 49:197-201.
ABSTRACT
Part of understanding a molecule’s conformational dynamics is mapping out the
dominant metastable, or long lived, states that it occupies. Once identified, the rates
for transitioning between these states may then be determined in order to create a
complete model of the system’s conformational dynamics. Here we describe the use of
the MSMBuilder package (now available at https://simtk.org/home/msmbuilder/) to
build Markov State Models (MSMs) to identify the metastable states from Generalized
Ensemble (GE) simulations, as well as other simulation datasets. Besides building
MSMs, the code also includes tools for model evaluation and visualization.
INTRODUCTION
Molecular Dynamics (MD) and Monte Carlo (MC) computer simulations have the
potential to complement experiments by elucidating the chemical details underlying
the conformational dynamics of biological macromolecules like proteins and RNA.
Such simulations sample a system’s free energy landscape, which is characterized by
long-lived, or metastable, states separated by large free energy barriers. Thus,
understanding a system’s conformational dynamics can be broken down into two
7
steps: 1) identifying the long lived, or metastable, states visited by the system and 2)
determining the rates of transitioning between these states. Unfortunately, it is
extremely difficult to adequately sample the conformational space accessible to
biomolecules. Furthermore, even if adequate sampling can be achieved, the resulting
datasets are often quite large and, therefore, difficult to analyze and interpret.
A popular approach to the first step is to use Generalized Ensemble (GE)
algorithms (21-25) to sample the accessible space and then to generate projections of
the free energy landscape onto some set of order parameters to identify the dominant
thermodynamic states (26-29). GE algorithms, such as the Replica Exchange Method
(REM) (22, 23) and Simulated Tempering (ST) (24, 25), achieve broad sampling at
the temperature of interest by performing a random walk in temperature space. Broad
sampling is possible because an energy barrier that is difficult to cross at the
temperature of interest will be flattened out and, therefore, more easily crossed at
higher temperatures. GE algorithms also maintain canonical sampling at every
temperature. Thus, they are a suitable way to sample the accessible space.
Projections of the free energy landscape onto a few order parameters are
frequently used to make sense of the resulting dataset (26-29). Such projections may
be meaningful if an appropriate set of order parameters is chosen; however, this is
quite difficult so there is always the danger of being misled by projections because
meaningful information along other order parameters may be completely lost (3, 30).
For example, structures that fall within the same basin in some projection may have
little structural or kinetic similarity. Thus, choosing a representative conformation for
that basin may be impossible.
Clustering methods, on the other hand, do not have these issues because the
dominant order parameters do not need to be specified in advance. However, most
clustering algorithms group conformations together based solely on their structural
similarity (31, 32), so they may fail to capture important kinetic properties. To
illustrate the importance of integrating kinetic information into the clustering of
8
simulation trajectories, one can imagine two people standing on either side of a wall.
Geometrically these two individuals may be very close but kinetically speaking it
could be extremely difficult for one to get to the other. Similarly, two conformations
from a simulation dataset may be geometrically close but kinetically distant and,
therefore, a clustering based solely on a geometric criterion would be inadequate for
describing the system’s dynamics.
Here we describe the use of Markov State Models (MSMs) to identify
metastable states in GE datasets, though we note that the MSMBuilder package we
introduce to build MSMs may be applied to any simulation dataset. An MSM may be
thought of as a form of clustering that incorporates kinetic information by grouping
conformations that can interconvert rapidly into the same state and conformations that
cannot interconvert rapidly into different states (3, 6, 9, 11, 33, 34). Thus,
conformations in the same metastable state, which may be thought of as a large free
energy basin, will be grouped together while conformations separated by large free
energy barriers will not.
A biomolecular folding free energy landscape may be thought of as a hierarchy
of basins (35, 36). Since larger basins may contain numerous smaller local minima our
use of the phrase free energy basin above is somewhat ambiguous. To determine what
constitutes a distinct free energy basin an MSM may be represented as a transition
probability matrix where the entry at row i and column j gives the probability of
transitioning from state i to state j during a time Δt, called the lag time. Based on this
matrix one may obtain a series of implied timescales for transitioning between various
regions of phase space and use this information to determine an appropriate number of
metastable states, as explained below. The number of metastable states to be
constructed controls the resolution of the model by determining how large a barrier
must be in order to divide phase space into multiple states.
In the past, MSMs have generally been used to model kinetics and, therefore,
have been built from constant temperature data. For example, MSMs have been used
9
to model numerous small systems (33, 37, 38) and a few larger ones (39, 40). Since
GE simulations perform a random walk in temperature space they do not have
physical kinetics. However, GE simulations contain the desired canonical ensemble
and therefore the desired free energy barriers. These barriers may be flattened or
distorted at higher temperatures but the barriers at the temperature of interest should
still be sufficient to provide the desired separation of timescales. That is, fast intrastate
transitions and slower interstate transitions. Thus, the pseudo-kinetics of GE
simulations are still sufficient to identify the dominant metastable states.
In the following sections we describe the use of the MSMBuilder package
(now available at https://simtk.org/home/msmbuilder/) to identify the dominant
metastable states in GE datasets, though we note the method may be applied as is to
datasets generated with other algorithms and is easily extensible to completely
different problems. There are four major steps in the procedure: 1) dividing the data
into small sets called microstates based on their structural similarity, 2) lumping
kinetically related microstates together into metastable states (also called macrostates),
3) extracting representative conformations for each state, and optionally 4) calculating
populations of each state to judge convergence. Steps 1-3 are depicted schematically
in Figure 1. The conformations extracted with this method represent the space
explored by the system and thus give insights into its dynamics. The pseudo-kinetics
of the GE simulations may give some indication of the connectivity of these states but
cannot give conclusive results due to the random walk in temperature space. However,
this method may serve as a basis for obtaining both accurate thermodynamics and
kinetics (Huang et al. in preparation).
10
Figure 1. Schematic of the steps required for building an MSM and obtaining
representative conformations for each state. First, GE data represented by points
are grouped into microstates represented by circles, with darker circles for more
highly populated microstates. Kinetically related microstates are then lumped
together into macrostates, or metastable states, represented by amorphous shapes.
Finally, representative conformations are obtained by extracting the most probable
conformation from each macrostate.
DESCRIPTION OF METHOD
1. DIVIDING THE DATA INTO MICROSTATES
The first step in building an MSM is to divide the data into thousands of microstates
based on their structural similarity (6). For conformational dynamics we measure
structural similarity by the RMSD for some subset of the atoms. While the RMSD
may not be very meaningful for large distances, it does have a kinetic interpretation
for small distances. That is, conformations with very small RMSDs should be able to
interconvert rapidly. Thus, if a microstate is small enough that every member has a
very small RMSD to every other member then one may assume that their structural
similarity implies a kinetic similarity.
11
However, one must also take care not to generate microstates that are too small
because it is important to see a sufficient number of transitions between them. For
example, if every conformation were put into its own microstate no pair of trajectories
would ever visit the same microstate. Thus, the most meaningful grouping of
microstates would be to group every conformation in the same trajectory together and
no new insight would be gained.
One method for determining an appropriate size for each microstate is to
measure the average RMSD between every pair of temporally adjacent conformations
in each trajectory and to ensure that the diameter of each microstate is no more than
this value (Sun et al. in preparation). Thus, any pair of conformations within a given
microstate will tend to be within one MD step of each other. However, this method
may be overly stringent. We have found that using microstates with an all-heavy-atom
RMSD radius of about 3.0 Å allows us to capture the true equilibrium distribution for
an 8 nucleotide RNA hairpin. Preliminary work in our lab shows that radii on the
order of 2-2.5 Å seem more appropriate for protein systems.
One can use the doFastGromacsClustering executable provided by the
Clusterer component of the MSMBuilder package to divide a dataset into microstates.
At present the Clusterer code is capable of using an approximation of the k-centers
clustering algorithm (41, 42) to divide simulation datasets generated with the Gromacs
software package (43) into some number of microstates. However, it is written in
object oriented C++ code so it is straight forward to add new clustering algorithms,
data types to cluster, distance metrics, and other components.
The approximate k-centers clustering algorithm was chosen as the default
clustering method because it is deterministic, simple, fast, and creates clusters with
approximately equal radii (42). The algorithm works as follows: 1) every point is
initially infinitely far from any cluster center, 2) choose an arbitrary point as the first
cluster center, 3) compute the distance between every point and the new cluster center,
4) assign points to this new cluster center if they are closer to it than the cluster center
12
they are currently assigned to, 5) declare the point that is furthest from every cluster
center to be the next new cluster center, and 6) repeat steps 2-5 until the desired
number of clusters have been generated. Thus, the algorithm has complexity O (kN)
where k is the number of clusters to be generated and N is the number of data points to
be clustered. An order of magnitude speedup is also made possible by using the
triangle inequality to avoid unnecessary distance computations (Sun et al. in
preparation). This fast version of the algorithm is used by default, though the original
version described above is also available. Besides the cluster definitions, this program
also gives the radius of each microstate and the average and standard deviation of the
RMSD from every member of the microstate to the cluster center.
The arbitrary starting point used by this approximate k-center clustering
algorithm would be of some concern for small k or if the microstates were our primary
interest. However, we have found that the clustering results are insensitive to the
starting point for large k (e.g. k > 1000). In addition, we are mainly concerned with the
macrostates generated by lumping kinetically related microstates together. The
lumping algorithm described in the next section is fairly insensitive to the exact
boundaries between microstates as long as each microstate is sufficiently small, so the
arbitrary starting point is acceptable for building MSMs.
An attractive feature of this approximate k-centers algorithm is that it yields
clusters of approximately equal volume (as judged by using the maximal distance
between the cluster center and any other point in the cluster as the radius of a sphere)
(42). This property is of value because it means that the population of a cluster is
approximately proportional to its density in phase space. However, we note that
exploiting this interpretation requires some caution as it is unclear how to compute
exact volumes in a high dimensional phase space and, therefore, difficult to measure
densities in phase space precisely. Regardless, this property also allows the boundaries
between metastable states to be well-resolved. Clustering algorithms that do not have
this guarantee may create large clusters in sparse regions of phase space and small
clusters in dense regions. The large clusters in sparse regions of phase space are prone
13
to violate the assumption that conformations within a microstate are kinetically
related. Therefore, various conformations in the microstate may be most kinetically
related to different metastable states, in which case it will be unclear which macrostate
to group the microstate with.
2. LUMPING MICROSTATES INTO MACROSTATES
Conceivably, one could extract a representative conformation for each microstate to
get an idea of the conformational space explored by the system of interest. However,
this would only be a slight improvement upon examining the raw data itself. Instead, it
is valuable to lump kinetically related microstates together into metastable states, also
called macrostates. The tools for lumping together microstates, as well as for
extracting representative conformations and determining state populations, may be
found in the PythonTools component of the MSMBuilder package.
The first step in generating a set of macrostates is to determine how many of
them to create (6). This task may be accomplished with the
BuildMSMsAsVaryLagTime.py script. This script builds a microstate MSM for
each of a series of lag times. A microstate MSM is just a transition probability matrix
where the entry in row i and column j is the probability that a simulation will be in
microstate j at time t+Δt given that it was in state i at time t. A series of implied
timescales are then calculated and printed to a file for each microstate MSM based on
the eigenvalues of the transition probability matrix. These implied timescales
correspond to the timescales for transitioning between different sets of microstates. An
appropriate number of macrostates to build can be determined based on the location of
the major gap in the implied timescales, which should correspond to the largest
separation of timescales within the system. The implied timescales for multiple lag
times are examined because the location of this gap is normally sensitive to the lag
time. Ideally the implied timescales will level out as the lag time increases (34) and
obvious gaps that are robust with respect to the lag time will be apparent, as indicated
in Figure 2. An appropriate number of macrostates is then one more than the number
14
of implied timescales above the major gap (3, 6). In non-ideal cases the number of
implied timescales above the gap will not level off. In such cases we recommend
erring on the side of having too many macrostates rather than too few. If too many
macrostates are generated then some of the representative conformations may be
redundant (only separated by small barriers), whereas if too few are constructed
important regions of phase space may not be identified.
Figure 2. Implied timescales as a function of the lag time. There are two probable gaps in the implied
timescales. If gap one were selected then a macrostate MSM with four states would be constructed
whereas if gap two were selected a higher resolution MSM with 6 states would be constructed.
A macrostate MSM with the appropriate number of states may then be built
using the BuildMacroMSM.py script. First, this script uses the Perron Cluster
Cluster Analysis (PCCA) algorithm (44, 45) to lump together kinetically related
microstates. The PCCA algorithm identifies kinetic relationships based on the
eigenvalue/eigenvector structure of the microstate MSM and will not be described in
detail here. This initial lumping is then refined using simulated annealing to maximize
the metastability (6), which is defined as
15
N
i
iiTQ1
),( (1)
where N is the number of macrostates and T is the macrostate MSM transition
probability matrix. In words, the metastability is the sum of the self-transition
probabilities of each macrostate. Thus, the metastability may range from 0 to N.
Maximizing the metastability is a heuristic for maximizing the separation of
timescales (6). During each simulated annealing step a randomly selected microstate is
reassigned to a randomly selected macrostate, the resulting change in metastability is
calculated, and the move is either accepted or rejected based on the Metropolis
criterion.
We recommend using a lag time of one step to build the MSM to maximize the
use of all the data. The resulting state definitions and a longer lag time may then be
used to obtain populations and transition rates. A lag time within the implied timescale
gap should yield a strongly Markovian model. That is, one with a sufficiently large
separation of timescales that the assumption that the state at time t+Δt depends only on
the state at time t is valid.
The main outputs of the BuildMacroMSM.py script are a mapping from
microstates to macrostates and the metastability of this lumping. The mapping from
microstates to macrostates may be used to determine which macrostate each data point
is in using the WriteMacroAssignments.py script or the
doFastGromacsAssign program. In general, the
WriteMacroAssignments.py script should be used as it is faster. Both methods
allow the user to specify a temperature range and will only print out assignments for
conformations within this range. This feature is useful for calculating populations of
states at a given temperature. The mapping may also be used by the
getMacroStateCenters program to get information about each macrostate, such
as the most geometrically central microstate and the average and the standard
deviation of the RMSD between that microstate’s center and the center of every other
16
microstate in that macrostate. Such information is useful for getting an idea of the size
of each macrostate.
3. EXTRACTING REPRESENTATIVE CONFORMATIONS
There are a number of ways of extracting representative conformations for each
macrostate. A simple way of getting a single conformation is to use the
getMacroStateCenters program as discussed above. However, one must
remember that conformations selected in this manner represent the geometric center of
each macrostate and not necessarily the most probable member of each macrostate.
To understand the distribution of conformations in each macrostate one may
identify the central conformation of each microstate in a given macrostate using the
GetMicroCentersByMacroState.py script. The conformations for a given
macrostate may then be overlaid in a viewer for visual analysis. Such an approach may
be cumbersome if there are too many microstates in each macrostate. One alternative
is to randomly select a reduced number of conformations from each macrostate using
the GetRandomConfsFromEachState.py script. A major shortcoming of these
methods is that they select conformations with a more or less uniform distribution
across the macrostate.
Probably the best way of extracting representative conformations is to use the
GetDensityInfo.py script. This script outputs a list of the microstates in each
macrostate ordered from densest to sparsest. That is, the most probable to the least
probable. Any number of the most probable structures in a given macrostate may then
be selected and overlaid in a viewer to get an idea of the distribution of conformations
within the state.
4. JUDGING CONVERGENCE
Unfortunately there is no analytic way of checking that a single set of simulations has
explored the entire accessible space for a given system and, therefore, yielded
17
representative conformations that accurately describe the conformational dynamics.
To the best of our knowledge, the most effective way to ensure that the entire space
has been explored is to run two distinct sets of simulations started from very different
initial configurations. The populations for each state may then be calculated for each
dataset. If they agree then one can be relatively sure that the entire space has been
explored because the thermodynamics found are independent of the starting
conformation.
One practical consideration is that the same state definition must be used for
both datasets because it is unclear how to compare different MSMs. A common state
definition may be obtained by building a single MSM based on both datasets. The
WriteMacroAssignments.py script or doFastGromacsAssign program
may then be used to independently assign each dataset to this common state definition,
preferably restricting the assignments to the temperature range of interest for GE
datasets, so that the population of each state may be determined. Of course, due to the
stochastic nature of conformational dynamics the two sets of populations are unlikely
to agree exactly. To make a valid comparison the GetMacroMSMPopStats.py
script may be used to obtain error bars on the populations from each dataset. This
script uses a bootstrapping algorithm to approximate the variation in the populations.
If the populations agree within error then the two simulations may be considered to
have converged to the true equilibrium distribution and one may be relatively sure that
the entire accessible space has been explored. Thus, the conformations extracted in
step 3 will provide an accurate depiction of the conformational dynamics of the
system.
CONCLUSIONS
Using the MSMBuilder to analyze GE simulations and other datasets will allow
researchers to quickly map out the conformational space explored by biological
macromolecules like RNA, which is the first step to understanding conformational
18
dynamics. The MSMBuilder may also be used to determine the rates of transitioning
between states in microcanonical and canonical simulations, resulting in a complete
Markov state model for the system’s conformational dynamics. While more
sophisticated algorithms for building MSMs exist (6), they are not likely to provide
much improvement for analyzing GE datasets due to the distortion resulting from high
temperature data. The highly extensible object oriented design of the code should
allow such algorithms to be incorporated easily for use with other datasets though.
Incorporating other data types, clustering methods, distance metrics, and analysis tools
should also be straight forward. In particular, this software serves as a foundation for
automating adaptive sampling algorithms (19), which promise to allow the maximal
use of one’s computing resources by focusing sampling on regions of uncertainty.
Finally, the results of applying this method to GE datasets may be used as a basis for
determining the rates of transitioning between states (Huang et al. in preparation),
thereby giving a complete picture of a system’s dynamics.
19
CHAPTER 2: PROGRESS AND CHALLENGES IN THE AUTOMATED
CONSTRUCTION OF MARKOV STATE MODELS FOR FULL PROTEIN
SYSTEMS
This chapter was taken from: Bowman GR, Beauchamp KA, Boxer G, & Pande VS
(2009) Progress and challenges in the automated construction of Markov state models
Journal of Chemical Physics 131:124101.
ABSTRACT
Markov State Models (MSMs) are a powerful tool for modeling both the
thermodynamics and kinetics of molecular systems. In addition, they provide a
rigorous means to combine information from multiple sources into a single model and
to direct future simulations/experiments to minimize uncertainties in the model.
However, constructing MSMs is challenging because doing so requires decomposing
the extremely high dimensional and rugged free energy landscape of a molecular
system into long-lived states, also called metastable states. Thus, their application has
generally required significant chemical intuition and hand tuning. To address this
limitation we have developed a toolkit for automating the construction of MSMs
called MSMBuilder (available at https://simtk.org/home/msmbuilder). In this work we
demonstrate the application of MSMBuilder to the villin headpiece (HP-35 NleNle),
one of the smallest and fastest folding proteins. We show that the resulting MSM
captures both the thermodynamics and kinetics of the original molecular dynamics of
the system. As a first step towards experimental validation of our methodology we
show that our model provides accurate structure prediction and that the longest
timescale events correspond to folding.
20
INTRODUCTION
For a molecular system, the distribution of conformations and the dynamics between
them is determined by the underlying free energy landscape. Thus, the ability to map
out a molecule’s free energy landscape would yield solutions to many outstanding
biophysical questions. For example, structure prediction could be accomplished by
identifying the free energy minimum (46), leading to insights into catalytic
mechanisms of proteins that are difficult to crystallize. Intermediate states, such as
those currently thought to be the primary toxic elements in Alzheimer’s disease (47),
could also be identified by locating local minima. As a final example, protein folding
mechanisms could be understood by examining the rates of transitioning between all
the relevant states.
Unfortunately, the free energy landscapes of solvated biomolecules are
extremely high dimensional and there is no analytical means to identify all the relevant
features, especially when one is concerned with molecules in which small molecular
changes yield significant perturbations of the system, such as amino acid mutations in
proteins. Therefore, a theoretical treatment requires sampling the potential, generally
using Monte Carlo (MC) or Molecular Dynamics (MD), and then inferring
information about the states in the free energy landscape from the sampled
configurations. Moreover, if one is interested in kinetic properties, one must go further
and sample kinetic quantities (e.g. rates) of interconversion between these
thermodynamic states.
Mapping out a molecule’s free energy landscape can be broken down into
three stages: 1) identifying the relevant states and, in particular, the native state, 2)
quantifying the thermodynamics of the system, and 3) quantifying the kinetics of
transitioning between the states. Each of these stages builds upon the preceding stages.
In fact, this hierarchy of objectives is evident in the literature. For example, in the
structure prediction community it is common to plot the free energy as a function of
the RMSD to the native state (48). Such representations allow researchers to quickly
21
assess whether or not their potential accurately captures the most experimentally
verifiable state, the native state. However, they provide little information on the
presence of other states, their relative probabilities, or the kinetics of moving between
them (49). Projections of the free energy landscape onto multiple order parameters, on
the other hand, may capture multiple states and their thermodynamics (30, 49). The
main limitation of these representations is that they depend heavily upon the order
parameters selected (30). If the order parameters are not good reaction coordinates,
then important features may be distorted or even completely obscured (30, 50).
Furthermore, barring the selection of a perfect set of reaction coordinates, such
projections only yield limited information about the system’s kinetics due to loss of
information about other important degrees of freedom (51).
Clustering techniques are a promising means of overcoming these limitations
as they allow the automatic identification of the relevant degrees of freedom (52).
However, most clustering techniques are based solely on geometric criteria (31, 32) so
they may fail to capture important kinetic properties. To illustrate the importance of
integrating kinetic information into the clustering of simulation trajectories, one can
imagine two people standing on either side of a wall. Geometrically these two
individuals may be very close but kinetically speaking it could be extremely difficult
for one to get to the other. Similarly, two conformations from a simulation dataset may
be geometrically close but kinetically distant and, therefore, a clustering based solely
on a geometric criterion would be inadequate for describing the system’s dynamics.
Markov State Models (MSMs) fit nicely into this progression as they provide a
natural means to achieve a complete understanding of a molecule’s free energy
landscape—a map of all the relevant states with their correct thermodynamics and
kinetics (3, 6, 9, 10, 53). The critical distinction between MSMs and other clustering
techniques is that an MSM constitutes a kinetic clustering of one’s data (3, 6, 9, 10).
That is, conformations that can interconvert rapidly are grouped into the same state
while conformations that can only interconvert slowly are grouped into separate states.
Such a kinetic clustering ensures that equilibration within a state, and therefore loss of
22
memory of the previous state, occurs more rapidly than transitions between states. As
a result, the model satisfies the Markov property—the identity of the next state
depends only on the identity of the current state and not any of the previous states.
MSMs are better able to capture the stochastic nature of processes like protein
folding than traditional analysis techniques, allowing more quantitative comparisons
with and predictions of experimental observables. Thus, they will allow researchers to
move beyond the traditional view of MD simulations as molecular microscopes. An
MSM also provides a natural means of varying the resolution of one’s model. For
example, consider a protein folding process that occurs on a 10 μs timescale. Using a
cutoff of one ns to distinguish a fast transition from a slow one would yield a high
resolution model that may be difficult to interpret by eye. Using a cutoff of one μs,
however, would likely yield a high-level model capturing the essence of the process in
a human readable form. MSMs provide a rigorous means to combine data from
multiple sources and can be used to extract information about long timescale events
from short simulations (11, 54, 55). Finally, there are a number of ways of exploiting
MSMs to minimize the amount of computation that must be performed to achieve a
good model for a given system (12, 19, 20).
Unfortunately, constructing MSMs is a difficult task because it requires
dividing the rugged and high dimensional free energy landscape of a system into
metastable states (6). A good set of states will tend to divide phase space along the
highest free energy barriers. More specifically, none of the states will have significant
internal barriers. Such a partitioning ensures the separation of timescales discussed
above—intrastate transitions are fast relative to interstate transitions—and, therefore,
that the model is Markovian. States with high internal barriers break the separation of
timescales and introduce memory. To illustrate this situation, imagine a state divided
in half by a single barrier that is higher than any barrier between states. Besides
breaking the separation of timescales by causing transitions within this state to be slow
relative to transitions between states, trajectories that enter the state to the left of the
internal barrier will also tend to leave to the left while trajectories that enter on the
23
right will tend to leave to the right. Thus, the probability of any possible new state will
depend both on the identity of the current state and the previous state, breaking the
Markov property. Avoiding such internal barriers has generally required a great deal
of chemical insight and hand tuning (33, 39); thus, the application of MSMs has been
limited.
To facilitate the more widespread use of MSMs we have developed an open
source software package called MSMBuilder that automates their construction (now
available at https://simtk.org/home/msmbuilder) (10). MSMBuilder builds on previous
automated methods (6) by incorporating new geometric and kinetic clustering
algorithms. It also provides a command-line interface built on top of an object oriented
structure that should allow for the rapid incorporation of new advances. In summary,
MSMBuilder works as follows: 1) group conformations into very small states called
microstates and assume the high degree of structural similarity within a state implies a
kinetic similarity, 2) validate that this state decomposition is Markovian, and
optionally 3) lump the microstates into some number of macrostates based on kinetic
criteria and ensure that this macrostate model is Markovian. There are also a number
of tools for analyzing and visualizing the model at both the microstate and macrostate
levels.
In this work we demonstrate that MSMBuilder is able to construct MSMs for
full protein systems in an automated fashion by applying it to the villin headpiece
(HP-35 NleNle) (56, 57). Unlike the peptides that have been studied with automated
methods in the past (6), villin has all the hallmarks of a protein, such as a hydrophobic
core and tertiary contacts. It is also fast folding, so it is possible to carry out
simulations on timescales comparable to the folding time (58).
Our hope is that this work will serve as a guide for future users of
MSMBuilder. Thus, we will discuss failed models, the insights these models gave us,
and how these insights led to the final model. We will also discuss some of the
remaining limitations in the automated construction of MSMs. In addition, we will
24
demonstrate that our model yields accurate structure prediction and that the longest
timescales correspond to folding. However, our main emphasis will be on the
methodology of building MSMs that faithfully represent the raw simulation data. In
particular, we will focus on the microstate level as this is the finest resolution and
bounds the performance of lower resolution models. The full biophysical implications
of the model and their relation to experimental results will be discussed more
thoroughly in a later work.
MATERIALS & METHODS
SIMULATION DETAILS
The data set used in this study was taken from Ensign et al. (58) and is described
briefly below. It consists of ~450 simulations ranging from 35 ns to 2 μs in length and
is publicly available at the SimTK website (https://simtk.org/home/foldvillin).
First, the crystal structure (PDB structure 2F4K) (56) was relaxed using a
steepest descent algorithm in GROMACS (43, 59) using the AMBER03 force field
(60). The resulting structure was placed in an octahedral box of dimensions 4.240
nm×4.969 nm×4.662 nm and solvated with 1306 TIP3P water molecules. Nine 10 ns
high temperature simulations (at 373 K), each with different initial velocities drawn
from a Maxwell–Boltzmann distribution, were run from this solvated structure. The
final structures from each of these unfolding simulations were then used as the initial
points for ~450 folding simulations at 300 K.
Folding simulations were preceded by 10 ns equilibration simulations at
constant volume and the protein coordinates fixed. For all MD simulations, the
SHAKE (61) and SETTLE (62) algorithms were used with the default GROMACS 3.3
parameters to constrain bond lengths. Periodic boundary conditions were employed.
To control temperature, protein and solvent were coupled separately to a Nosé–
Hoover thermostat (63, 64) with an oscillation period of 0.5 ps. The system was
25
coupled to a Parrinello–Rahman barostat (65, 66) at 1 bar, with a time constant of 10
ps, assuming a compressibility of 4.5×10−5 bar−1. Velocities were assigned randomly
from a Maxwell–Boltzmann distribution. The linear center-of-mass motion of the
protein and solvent groups were removed every ten steps. A cutoff at 0.8 nm was
employed for both the Coulombic and van der Waals interactions. During these
simulations, the long-range electrostatic forces were treated with a reaction field
assuming a continuum dielectric of 78, and the van der Waals was treated with a
switch from 0.7 nm to 0.8 nm. The neighborlist was set to 0.7 nm for computational
performance.
MARKOV STATE MODEL CONSTRUCTION
All the MSMs used in this paper were constructed with MSMBuilder (10), the relevant
components of which are reviewed below. A significant modification of the code was
the introduction of sparse matrix types, which allows the construction of MSMs with
many more states than previously possible by making more efficient use of the
available memory. Sparse matrices will be included in the next release of
MSMBuilder.
CLUSTERING
An approximate k-centers clustering algorithm was used to generate the microstates in
all the MSMs used in this study (41, 42). The algorithm works as follows: 1) choose
an arbitrary point as the first cluster center, 2) compute the distance between every
point and the new cluster center, 3) assign points to this new cluster center if they are
closer to it than the cluster center they are currently assigned to, 4) declare the point
that is furthest from every cluster center to be the next new cluster center, and 5)
repeat steps 2-4 until the desired number of clusters have been generated. The
computational complexity of this algorithm is O(kN) where k is the number of clusters
and N is the number of data points to be clustered. The algorithm is intended to give
clusters with approximately equal radii, where the radius of a cluster is defined as the
26
maximum distance between the cluster center and any other data point in the cluster.
Given that MD simulations are Markovian (9), it should be possible to generate
a Markov model for simulation dynamics by constructing sufficiently small (or
numerous) states. However, the size of a given data set will limit how many clusters
can be generated because reducing the number of conformations in each state will
eventually result in an unacceptable level of statistical uncertainty.
Based on the Boltzmann relationship, we can calculate the free energy of a
state as – kTlog (p), where p is the probability of being in the state. Though small
variations in the radii of microstates may imply quite large variations in their volumes
due to the high dimensionality of the phase space of biomolecules, empirically we find
that assuming the clusters have equal volume is useful. In particular, we find that
interpreting lower free energy microstates as having higher densities and evaluating
models based on the correlation between the free energy and RMSD of each
microstate agrees with other measures of the validity of an MSM, such as implied
timescales plots as discussed below. Because this relationship is not guaranteed to
hold the correlation between microstate free energy and RMSD should never be used
as the sole assessment of a model. As discussed in the Results & Discussion, it is quite
useful for identifying potential shortcomings of a given model. These issues are not a
concern at the macrostate level.
All clustering in this work was based on the heavy-atom RMSD between pairs
of conformations. However, we note that pairs of atoms in the same side chain that are
indistinguishable with respect to symmetry operations were excluded from the RMSD
computations.
Representative conformations from some clusters are shown using VMD(67).
TRANSITION PROBABILITY MATRICES
Transition probability matrices are at the heart of MSMs (9). Row normalized
transition probability matrices are used in this study. The element in row i and column
27
j of such a matrix gives the probability of transitioning from state i to state j in a
certain time interval called the lag time (τ).
The transition probability matrix serves many purposes. For example, a vector
of state probabilities may be propagated forward in time by multiplying it by the
transition probability matrix.
)()()( Ttptp (1.1)
where t is the current time, τ is the lag time, p(t) is a row vector of state probabilities at
time t, and T(τ) is the row normalized transition probability matrix with lag time τ.
The eigenvalue/eigenvector spectrum of a transition probability matrix gives
information about aggregate transitions between subsets of the states in the model and
what timescales these transitions occur on (9). More specifically, the eigenvalues are
related to an implied timescale for a transition, which can be calculated as
)ln(
k (1.2)
where τ is the lag time and μ is an eigenvalue. The corresponding left eigenvector
specifies which states are involved in the aggregate transition. That is, states with
positive eigenvector components are transitioning with those with negative
components and the degree of participation for each state is related to the magnitude
of its eigenvector component (9).
IMPLIED TIMESCALES PLOTS
Implied timescales plots are one of the most sensitive indicators of whether or not a
model is Markovian (34). These plots are generated by graphing the implied
timescales of an MSM for a series of lag times. If the model is Markovian at a certain
lag time then the implied timescales should remain constant for any greater lag time.
The minimal lag time at which the implied timescales level off is the Markov time, or
28
the smallest time interval for which the model is Markovian. The implied timescales
for a non-Markovian model tend to increase with the lag time instead of leveling off.
Unfortunately, increasing the lag time decreases the amount of data and, therefore,
increases the uncertainty in the implied timescales. Thus, implied timescales plots can
be very difficult to interpret.
In this study error bars on implied timescales plots were obtained using a
bootstrapping procedure. Five randomly selected subsets of the available trajectories
were selected with replacement and the averages and variances of the implied
timescales for each lag time were calculated.
TIME EVOLUTION OF OBSERVABLES
The time evolution of the mean and variance of any molecular observable can be
calculated from an MSM. Calculating the time evolution of an observable X requires
calculating the average of X in each state i (Xi) and the average of X2 (Xi2). In this
study we took averages over five randomly selected conformations from each state.
An initial state probability vector may then be propagated in time as in Equation 1.1.
At each time step the mean and variance can be calculated as
(1.3)
N
iii XtpX
1
)(
(1.4) 222 XX
where N is the number of states, pi (t) is the probability of state i at time t, σ is the
standard deviation and
(1.5) N
iii XtpX 22 )(
29
RESULTS & DISCUSSION
AN INITIAL MODEL
Given the computational cost of running extensive MD simulations an important
consideration in constructing an MSM is to maximize one’s use of the available data.
Of course, one’s hardware always sets hard upper limits on the amount of data that
may be used at each stage of building an MSM. In particular, it may not always be
possible to fit all of the available conformations into memory for the initial clustering
phase of constructing an MSM with MSMBuilder. A convenient way of overcoming
this bottleneck is to use a subset of the available data to generate a set of clusters. Data
that was left out during the clustering phase may then be assigned to these clusters.
To maximize the use of our data while satisfying the memory constraints of
our system we first sub-sampled our dataset by a factor of 10 and clustered the
resulting conformations into 10,000 states. Snapshots were stored every 50 ps during
our MD simulations, which will henceforth be referred to as the raw data. Thus, the
effective trajectories used during our clustering consisted of snapshots separated by
500 ps. The remaining 90% of the data was subsequently assigned to this 10,000 state
model. Fortunately, it is possible to parallelize this assignment phase because the
cluster definitions are never updated after the initial clustering.
As discussed in the introduction, the first criterion for assessing the validity of
our model is whether or not it is capable of capturing the native state. The next
criterion is whether or not the thermodynamics of the model are correct. An initial
assessment of these two criteria may be obtained from a scatter plot of the free energy
of each state as a function of the RMSD of the state center from the native state.
There is some correlation between the free energy of a microstate and the
RMSD of its center from the crystal structure in this model, as shown in Figure 3A.
However, the most native-like RMSD of any of the state centers is 4.15 Å whereas the
simulations reach conformations with RMSD values as low as 0.52 Å. This
30
discrepancy is a first indication that there may be significant heterogeneity within the
states of this model. In particular, more near-native conformations must have been
absorbed into one or more other states. Highly heterogeneous states are likely to
violate the assumption that the degree of geometric similarity within a microstate
implies a kinetic similarity, preventing the construction of a valid MSM. This
conclusion is supported by the fact that the average distance between any
conformation and the nearest cluster center is over 4.5 Å.
31
Figure 3. Scatter plots of the free energy of each microstate (in kcal/mol) versus its RMSD. A) The
initial 10,000 state model, B) the 30,000 state model, C) the final 10,000 state model, and D) the
final 10,000 state model except that the average RMSD across five structures in each state is used
instead of the RMSD of the state center.
Final confirmation of the imperfections of the current 10,000 state model
comes from examining the implied timescales as a function of the lag time. If the
division into microstates were fine enough to ensure the absence of any large internal
barriers the largest implied timescales should be invariant with respect to the lag time
for any lag time greater than the Markov time(34). Figure 4 shows that the implied
timescales for this model continue to grow monotonically as the lag time is increased.
While the growth is not too severe it should be possible to improve upon this model
given the amount of sampling in the dataset.
Figure 4. Top ten implied timescales for the initial 10,000 state model.
Besides the structural and kinetic heterogeneity within states, the
monotonic growth of the implied timescales may also be due to the low number of
counts in some states and the resulting uncertainty in transition probabilities from
these states. For example, there are less than 10 data points in over 100 of the states at
32
the smallest lag time. Even for a state with ten data points no transition probability can
be resolved beyond a single significant digit. Increasing the lag time will reduce the
number of data points in every state, having particularly deleterious effects on
estimates of transition probabilities from states with low counts in the first place.
MORE STATES ARE NOT ALWAYS BETTER
As a first attempt at improving our original model we increased the number of states
from 10,000 to 30,000. Our objective in doing so was to avoid internal barriers by
dividing phase space into smaller states. In addition, we hoped to find more near
native states by pulling low RMSD conformations into their own clusters.
Clustering the data into more states did indeed result in more near-native
states, as shown in Figure 3B. The most native-like state center in the 30,000 state
model has an RMSD of 3 Å and there is still a general correlation between low free
energy and low RMSD. The average distance between any conformation and its
nearest state center was also reduced from 4.5 Å to 3.5 Å.
However, increasing the number of states also had some negative effects on the
model. In the 10,000 state model about 1% of the states had 10 or less conformations
in them whereas in the new 30,000 state model 6% of the states have 10 or fewer
conformations. Thus, the uncertainty in the transition probabilities from many states
will be greater. In addition, while increasing the number of states did create a handful
of more near-native states, it also more than doubled the number of states with an
RMSD over 10 Å. These phenomena are consistent with the fact that the approximate
k-centers clustering algorithm used in this work tends to create clusters with
approximately equal radii (41, 42). When adding more clusters, this property will tend
to result in most of the new clusters appearing in large sparse regions of phase space in
the tails of the distribution of conformations. As a result of these shortcomings, the
30,000 state model was found to have monotonically increasing implied timescales
33
similar to those for the 10,000 state model and, therefore, is not significantly more
Markovian than the previous model (data not shown).
DISREGARDING OUTLIERS DURING CLUSTERING YIELDS A MARKOVIAN MODEL
One approach to dealing with outliers would be to use all the data during the clustering
phase and then discard those clusters that behave in unphysical ways, such as clusters
that act as sinks. However, such an approach could discard legitimate trapped states.
In addition, the tendency of our approximate k-centers algorithm to select outliers as
cluster centers could easily result in a large fraction of clusters being discarded.
To deal with the limitations of our clustering algorithm we reverted to using
10,000 states and increased the amount of sub-sampling at the clustering stage from a
factor of 10 to a factor of 100, which is equivalent to using trajectories with
conformations stored at a 5 ns interval for this data set. This change compensates for
the tendency of our approximate k-centers algorithm to select outliers as cluster
centers by reducing the number of available data points in the tails of the distribution
of conformations at the clustering stage. Thus, increasing the degree of sub-sampling
at our clustering stage focuses more clusters in dense regions of phase space where
more of the relevant dynamics are occurring. The remaining data can then be assigned
to these clusters, so no data is thrown out entirely. Incorporating the remaining data in
this manner will tend to enlarge clusters on the periphery of phase space because they
will absorb data points in the tails of the distribution of conformations. More central
clusters, on the other hand, will tend to stay approximately the same size. The number
of data points in every cluster should increase though, allowing better resolution of the
transition probabilities from each state.
A very simple kinetically inspired clustering scheme could be implemented by
sub-sampling to select N evenly spaced conformations (in time) as cluster centers. In
this case a large number of clusters would appear in dense regions of phase space
while there would be very few clusters in sparse regions. Our current approach is an
34
intermediate between such a kinetically inspired clustering and the purely
geometrically defined clustering used in our first two models. It is intended to have
some of the strengths of both approaches—i.e. fine resolution everywhere as in the
geometric approach but even more so in dense regions of phase space as in the kinetic
approach.
In fact, sub-sampling more at the approximate k-centers clustering stage and
then assigning the remaining data to these clusters does improve the structural,
thermodynamic, and kinetic properties of the model. Based on our experience with
this data set and a few others (RNA hairpins and small peptides, data not shown) a
good starting point is to sub-sample such that 10N conformations are used to generate
N clusters and conformations used during the clustering are separated by at least 100
ps. The remaining data should then be assigned to these clusters. The degree of sub-
sampling and number of clusters may then be adjusted to improve the model as
necessary as the optimal parameters will depend on the system. In particular, the
optimal strategy may be quite different for much smaller or larger systems.
Structural agreement: Figure 3C shows that our new model has state centers
with RMSDs as low as 3.4 Å, which is somewhat higher than the 30,000 state model
but better than the original model. Examination of randomly selected structures from a
number of states revealed that the microstate center is not always a good
representative of the state. In particular, some near-native states have a dense pocket
of very low RMSD conformations and a handful of outliers. In such cases our
approximate k-centers clustering algorithm will select a conformation in between the
dense pocket of low RMSD states and the outliers (41) when really a structure from
the denser region would be more representative of the state. A further improvement in
the structural characterization of the model is made possible by calculating the average
RMSD over five randomly selected conformations from each state instead of just the
state center, as shown in Figure 3D. This analysis reveals that the most native-like
state has an average RMSD of about 1.8 Å. To illustrate the agreement between this
state and the crystal structure Figure 5A shows an overlay of three randomly selected
35
conformations from this state with the crystal structure. An interesting future direction
would be to further validate near-native states by comparing them directly with the
experimental data rather than the model thereof.
Figure 5. Three representative structures for A) the lowest RMSD state in the final model and B) the
most probable state in the final model overlaid with the crystal structure (red). The phenylalanine
core is shown explicitly for each molecule.
36
Thermodynamic agreement: As discussed in the introduction, we cannot
calculate the equilibrium distribution of villin analytically so we do not have an
absolute reference point to judge our model against. However, there are some
promising features of the thermodynamics of the model that lend it credibility. The
most populated state has about 4% of the total population and has an average RMSD
of 2.3 Å. Figure 5B illustrates the agreement between three random conformations
from this state and the crystal structure. The state with the lowest average RMSD also
has the fifth highest population, which is about 2% of the total population, and about
12% of the conformations are in states with average RMSD values less than 3 Å.
There is also a reasonable correlation between the RMSD and the free energy, as
shown in Figure 3D. Our results seem to be robust with respect to the method used for
calculating the equilibrium distribution as well, as discussed in Appendix A. Finally,
the populations from the MSM are consistent with those from averaging over the raw
data in successive windows of the simulation time, indicating that the MSM
thermodynamics are in agreement with the underlying potential if not experiment (data
not shown).
Here it is important to note that none of the simulations were started from the
native state. While this is not formally a blind prediction (since the crystal structure
has been previously reported (57)), it is promising that so many simulations folded
under the given potential, allowing one to not merely reach the folded state but predict
its structure ab initio. It will be interesting to see if this procedure can yield similar
results in a blind prediction, or at least when structural criteria are not used as a basis
for adjusting the model as in this work.
Kinetic agreement: Another promising feature of this model is that there are no
fewer than 12 data points in every state, indicating that this model may be able to
better resolve the transition probabilities for most states. In fact, the implied timescales
for this model do seem to level off as the lag time is increased. Figure 6A shows that
the longest timescales level off at a lag time of about 15 ns but increase moderately at
longer lag times. Figure 6B, however, shows that the implied timescales are level
37
within error from 15 to 60 ns. After about 35 ns there is an increase in the statistical
uncertainty in the implied timescales, explaining their apparent growth in Figure 6A.
After 60 ns the statistical uncertainty becomes enormous so implied timescales beyond
this point are not shown. Thus, this model appears to be Markovian at lag times of 15
ns and beyond.
38
Figure 6. Top ten implied timescales for the final model. A) The implied timescales at intervals of one
ns. B) The implied timescales with error bars obtained by doing five iterations of bootstrapping at
an interval of five ns.
The longest implied timescale for this model is about 8 μs. While this is quite
long relative to the experimentally predicted folding time of 720 ns at 300 K (56), it is
consistent with previous simulation work suggesting that the experimental
measurements may be monitoring structural properties which relax faster than the
complete folding process (58). In that study, the authors found that a surrogate for the
experimental observable was consistent with the experimental measurements but that
longer timescales on the order of 4 μs were present when monitoring the relaxation of
a more global metric for folding. Ensign et al. also found timescales as high as ~50 μs
by applying a maximum likelihood estimator to a subset of the data with little folding.
While this timescale is much longer than any of the implied timescales in our MSM, it
is not inconsistent with our model because the rates for transitioning between some
states in an MSM, when fit using a two-state kinetics assumption, may be slower than
the implied timescales. Ensign et al. likely identified one of these slow rates by
focusing on a subset of the data. For a more detailed discussion of this topic with a
simple example see Appendix B.
The components of the left eigenvector corresponding to the longest timescale
give information about what is occurring on this timescale. That is, states with positive
eigenvector components are interchanging with states with negative components and
the degree of participation in this aggregate transition is given by the magnitude of the
components (9). Figure 7 demonstrates that the longest timescale in our model does
correspond to folding by showing that it corresponds to transitions between high and
low RMSD states. Numerous states do not participate strongly in this transition,
explaining the streak of points with eigenvector components near zero.
39
Figure 7. The average RMSD of each state in the final model versus its left eigenvector component in
the longest timescale transition showing that this transition corresponds to folding.
For further confirmation that the MSM is an accurate model of the simulation
data we compared the predicted time evolution of the population of the native state
with the raw simulation data, where the native state was defined as all microstates with
an average Cα RMSD to the crystal structure less than 3 Å. Figure 8 shows that there
is good agreement between the MSM and raw data.
40
Figure 8. Comparison between the time evolution of the native population in the MSM (blue) and the
raw data (black) for the entire dataset. The error bars represent the standard error.
While the time evolution of state populations is a good test of our MSM, often
we will want to compute the time evolution of some observable to make comparisons
with and predictions of experiments. As an example we compare the predicted time
evolution of the Cα RMSD to the actual time evolution of the RMSD in the raw data
for each of the nine initial configurations. The means by which we calculated the
RMSD from the MSM is described in the Methods section. Measuring the time
evolution of the RMSD from the raw data is simply a matter of measuring the average
RMSD over the simulations started from the given initial structure at every time point.
We also included a reduced representation of the raw data in this comparison. In the
reduced representation each trajectory is represented as a series of states rather than a
series of conformations. The average RMSD at a given time point is then calculated by
averaging the RMSD of the states each of the relevant trajectories is in. It is important
to note that we used the average RMSD across five randomly selected conformations
(and the variance thereof) for each state rather than the RMSD of the state centers in
41
these comparisons. Just using the RMSD of the state centers resulted in poor
comparisons since they are not truly representative of the state, as discussed above.
Very good agreement (i.e. within the uncertainties of the observables) was
found between all three representations for seven of the nine starting configurations,
an example of which is shown in Figure 9A. In these cases the MSM was found to
capture both the mean and variance of the time evolution of the RMSD to high
precision. The agreement was less strong for the two remaining starting
conformations, as shown in Figure 9B. In these cases the reduced representation
agreed well with the raw data, showing that our states are structurally sufficient to
capture the correct behavior. The mean RMSD from the MSM does not agree as well
with the other two representations, though the true mean is still within the variance of
the prediction from the MSM. Note that this variance, as well as al the other variances
shown in Figure 9, are just due to the variance in the RMSD within each state and do
not include any of the statistical uncertainty in the model. Their large magnitude is an
indication of the heterogeneity of villin folding.
42
Figure 9. Comparison between the time evolution of the RMSD in the MSM (blue), the reduced
representation (yellow), and the raw data (black) for A) an example of good agreement and B) an
example of the worst case scenario. The error bars represent one standard deviation in the RMSD.
The discrepancy between the MSM predictions and the other two
representations for two of the starting structures indicates that our model still has some
subtle memory issues in a subset of the states. Interestingly, the two conformations
43
where the MSM agreed less well with the raw data were found to be faster folding
than the other seven initial configurations in a previous study(58). It would appear that
the slower folding trajectories are dominating the equilibrium distribution, causing all
the MSM predictions to level off at about 6 Å, which is too high for the two fast
folding initial configurations. Similar results were found with other observables, such
as the distance between the Trp23 and His27 residues that was previously used as a
surrogate for the experimental observable used to measure the folding time(58) (data
not shown).
REMAINING ISSUES
The most probable cause of any subtle memory issues in our model is the existence of
internal barriers within some states. As discussed previously, a state with a sufficiently
high internal barrier could cause transition probabilities from that state to depend on
the identity of the previous state. In particular, simulations started from one initial
configuration could tend to enter and exit a state in one way while simulations started
from a different initial configuration could tend to enter and exit the same state in a
completely different way.
To test for the existence of internal barriers we calculated independent MSMs
for each initial configuration. Each of these MSMs used the same state definitions,
however, only simulations started from the given starting conformation were used to
calculate the transition probabilities between states. All of these models agreed well
with the raw data. For example, Figure 10 shows good agreement for the starting
structure previously used as an example of the poorest agreement between the full
model and the raw data (shown in Figure 9B).
44
Figure 10. Improved agreement between the MSM and raw data for the example of poor agreement
from Figure 6B obtained by building the transition probability matrix from simulations started from
this starting structure alone. The error bars represent one standard deviation in the RMSD.
This improved agreement indicates that some states do indeed have internal
barriers. Moreover, the seven conformations for which the full model best reproduced
the raw data probably have the same behavior in these states while the two initial
configurations with poorer agreement between the full MSM and the raw data have a
different behavior in these states. The discrepancy then occurs because transition
probabilities for these states in the full MSM will be a weighted average of the two
types of behavior. The two starting conformations that contribute less heavily to this
weighted average are then captured less well by the full MSM.
In an attempt to address this problem we tried increasing the number of states
to 30,000. This model may have had some structural advantages and given a slightly
lower Markov time, however, it still suffered from the same subtle memory issues as
the 10,000 state version (data not shown). Models with even more states were not
attempted as they would greatly increase the number of states with very few counts
and, therefore, increase uncertainty in the model. These issues may be resolved by
45
identifying those states with internal barriers and splitting them further. However, such
hand-tuning is beyond the scope of this work, which focuses on the performance of
automated procedures for constructing MSMs.
CONCLUSIONS
Our analysis of the villin headpiece shows that the automated construction of MSMs
using MSMBuilder is now at a point where it can be applied to full protein systems, a
step beyond the small peptides that have been studied in the past(6, 68). This advance
was made possible by the proper application of our approximate k-centers clustering
algorithm. A naïve application of this algorithm to a molecular simulation dataset may
result in a mediocre state decomposition because outliers in sparse regions of phase
space are likely to be selected as cluster centers. To compensate for this tendency, one
can sub-sample at the clustering stage, effectively disregarding many of the outliers
and focusing the clusters in more relevant regions of conformational space. Data not
included in the clustering phases may then be assigned to the resulting model to
maximize the use of the available data. General guidelines for applying this result are
given in Section C of the Results & Discussion.
To demonstrate that our MSM is a reasonable map for villin’s underlying free
energy landscape, we showed that it is capable of accurate structure prediction and its
thermodynamics and kinetics are consistent with the raw simulation data. Thus, we
have laid a foundation for implementing an automated adaptive sampling scheme
capable of constructing models with the minimum possible computational cost. The
fact that our model captures both the mean behavior and heterogeneity of villin folding
will also allow for more accurate comparisons with experiments and predictions of
other experimental observables in a future work on the biophysics of villin folding. By
applying this methodology to multiple systems we hope to understand general
principles of protein folding. Of course, there is still room for improvement. Future
work on estimating reversible transition matrices from simulation data, clustering,
46
adaptive sampling, and exploring the connections between MSMs and Transition Path
Sampling (TPS)(33, 69) could extend the accuracy and applicability of MSMBuilder.
47
CHAPTER 3: MOLECULAR SIMULATION OF AB INITIO PROTEIN FOLDING
FOR A MILLISECOND FOLDER NTL9(1-39)
This chapter was taken from: Voelz VA, Bowman GR, Beauchamp KA, & Pande VS
(2010) Molecular simulation of ab initio protein folding for a millisecond folder
NTL9(1-39). J Am Chem Soc 132:1526-1528.
ABSTRACT
To date, the slowest-folding proteins folded ab initio by all-atom molecular dynamics
simulations have had folding times in the range of nanoseconds to microseconds. We
report simulations of several folding trajectories of NTL9(1-39), a protein which has a
folding time of ~1.5 milliseconds. Distributed molecular dynamics simulations in
implicit solvent on GPU processors were used to generate ensembles of trajectories
out to ~40 µs for several temperatures and starting states. At a temperature less than
the melting point of the forcefield, we observe a small number of productive folding
events, consistent with predictions from a model of parallel uncoupled two-state
simulations. The posterior distribution of the folding rate predicted from the data
agrees well with the experimental folding rate (~640/sec). Markov State Models
(MSMs) built from the data show a gap in the implied time scales indicative of two-
state folding, and heterogeneous pathways connecting diffuse mesoscopic substates.
Structural analysis of the 14 out of 2000 macrostates transited by the top ten folding
pathways reveals that native-like pairing between strands 1 and 2 only occurs for
macrostates with pfold > 0.5, suggesting β12 hairpin formation may be rate-limiting.
We believe that using simulation data such as these to seed adaptive resampling
simulations will be a promising new method for achieving statistically converged
descriptions of folding landscapes at longer time scales than ever before.
48
INTRODUCTION
A complete understanding of how proteins fold, i.e. self-assemble to their biologically
relevant “native state,” remains an unattained goal (70). Computer simulation,
validated by experiment, is a natural means to elucidate this. There is over a million-
fold range in folding rates, suggesting a possible diversity in mechanisms between
slow and fast folding proteins (71). Very fast (microsecond timescale) folding proteins
(56, 72) appear to fold via a large number of heterogeneous, parallel paths (58, 73,
74), potentially key for folding on such fast timescales. Does the folding of much
slower proteins change this picture?
To date, the slowest-folding proteins folded ab initio by all-atom molecular
dynamics simulations with fidelity to experimental kinetics have had folding times in
the range of nanoseconds to microseconds. These include the designed mini-protein
Trp-cage (~4.1 µs) (75), the villin headpiece domain (~10 µs) (76), a fast-folding
variant of villin (<1 µs) (58), and Fip35 WW domain (~13 µs) (77). In this
communication, we report simulations of several folding trajectories, each from fully
unfolded states, of the 39-residue protein NTL9(1-39), which experimentally has a
folding time of ~1.5 milliseconds (78).
MATERIALS & METHODS
Trajectories were simulated via the Folding@Home distributed computing platform
(79) at 300K, 330K, 370K and 450K from native, extended, and random-coil
configurations using an accelerated version of GROMACS written for GPU
processors (80), for an aggregate time of 1.52 ms. GPUs play a key role here, allowing
for dramatically longer trajectories than previously possible. The AMBER ff96
forcefield (60) with the GBSA solvation model (81) was used, a combination
previously shown to give good results folding Fip35 WW domain (77), and shown to
exhibit a good balance of native-like secondary structure for a set of small helical and
beta sheet peptides studied by replica exchange (82).
49
RESULTS & DISCUSSION
PREDICTION OF AB INITIO FOLDING AND FOLDING RATES
We find that the native state (taken from the N-terminal domain of the crystal structure
of ribosomal protein L9 (83)) is stable in this forcefield at 300K, exhibiting decreasing
stability with increasing temperature (Figure 11a). RMSD-C distributions after 10 µs
show well-defined native and collapsed unfolded basins near 3Å and 5Å, respectively.
Of the ~3000 trajectories started from unfolded (extended and coil) states at 370K
(Figure 11b), two reach an RMSD-C < 3.5Å and eight reach an RMSD-C < 4Å. No
productive folding trajectories were observed at lower temperatures, consistent with
the enhanced forward folding rate expected by Arrhenius kinetics. Higher temperature
trajectories (450K) exceed the melting temperature of NTL9 in the forcefield.
The observed number of folding events n is consistent with expectations from
a simple model of parallel uncoupled folding simulations (84) in which folding is
modeled as a two-state Poisson process: <n> = ∫M(t)k exp(-M(t)kt)dt, where M(t) is the
number of simulations that reach time t (Figure 11b) and k is the experimental folding
rate (~640/sec) (78). This theory predicts (on average) ~1.8 folding trajectories for the
amount of sampling performed, in agreement with the two folding trajectories found in
practice. Posterior distributions of folding rates given the amount of simulation time
and number of folding trajectories were computed using a Bayesian approach (85),
which yield expectation values within an order of magnitude of the experimental
folding rate.
50
Figure 11. (a) Distributions of RMSD-C for native-state simulations of NTL9(1-39) after 10 µs. The
arrows indicate thresholds defined for the native basin at 3.5Å and 4Å. (b) The number of parallel
simulations M(t) started from unfolded states at 370K that reach time t. (c) Posterior predictions of
the folding rate given the amount of simulation time and observed folding events for 3.5Å (dashed)
and 4Å (solid) thresholds, using uniform (black) and Jeffrey’s (gray) priors, using methods from
(85). In red is a Gaussian distribution representing the experimental rate mean and standard
deviation.
In addition to native-like conformations, we see near-native configurations,
which show heterogeneity in hydrophobic packing, most notably in alternative side
chain arrangements in the beta-sheet structure (Figure 12). Most common of these is a
non-native hydrophobic core involving residues I4, I18 and I37 (which normally
contact the C-terminal helix in the full-length protein) with F5 solvent-exposed.
INSIGHT INTO FOLDING MECHANISMS
In order to describe the kinetics and mechanistic aspects of folding, we employ a new
paradigm for sampling the global free energy landscape of folding, using Markov
State Models (MSMs). MSM approaches, by automatically identifying a set of
kinetically metastable states (such as foldons (86)) and efficiently sampling transitions
between these states, can model long-timescale kinetics from much shorter trajectories
(3, 6, 37, 54).
Our strategy for simulating slow-folding proteins is first to generate an initial
series of kinetically connected states from both the folding and unfolding directions,
and then to use adaptive resampling techniques (12) to produce statistically converged
estimates of metastable basins and the transition rates between them. In the remainder
of this communication, we report progress toward the first goal, by constructing an
MSM from the entire set of 370K trajectory data (4, 10), which we will use to seed
future rounds of transition sampling. While additional rounds of adaptive sampling
could likely aid in increasing the quantitative power of this model, there are several
notable observations which can be made with the current data set.
51
Figure 12. (a) A snapshot from a folding trajectory (dark blue) achieves an RMSD-C of 3.1Å
compared to the native state (cyan). (b) Non-native (top) and native-like (bottom) hydrophobic core
arrangements observed in low-RMSD conformations of folding trajectories. Highlighted are
sidechains of residues F5 (magenta), V3,V9,V21 (tan), and L30,L35 (pink).
Key to accurately identifying metastable states is the clustering of trajectory
conformations into microstates fine-grained enough to be used for lumping into
groups of maximally metastable macrostates (10). 100,000 microstate clusters were
calculated using an approximate k-centers algorithm (42), each with an average radius
of 4.5Å RMSD-backbone. Lag times ranging from 1 to 32 ns were used to build a
series of MSMs. The implied time scales predicted by these models (obtained by
diagonalizing the rate matrix) show a clear spectral gap separating the slowest
relaxation time scale from the rest, indicative of single-exponential kinetics (see
Figure 52). The implied time scale of the model levels off beyond a lag time of ~10 ns
to an implied time scale of ~1 ms, close to the experimental folding time.
An important strength of MSMs is their ability to gain insight at coarser scales
by “lumping” the kinetic transitions into a simpler model with fewer states. To gain a
mesoscopic view of the folding free energy landscape, we lumped our 100,000-
microstate MSM into a 2000-macrostate model. In this view, we find that the
metastable states are diffuse collections of conformations over which multiple possible
folding pathways can occur, indicating a vast heterogeneity of folding substates that
need to be understood in greater detail. At the same time, we can identify highly
52
populated “native” (state n) and “unfolded” (state a) macrostates that dominate the
observed relaxation rates (Figure 13 and Figure 53).
The ten pathways with the highest folding flux from macrostate a to n were
calculated by a greedy backtracking algorithm (see Appendix C) from the macrostate
transition matrix using transition path theory (5, 87) (TPT). The diversity of pathways
demonstrates the power of the MSM approach: although we observe only a few
folding trajectories directly, a network of many possible pathways can be inferred
from the overlapping sampling of local transitions.
While NTL9(1-39) folds quickly for a two-state folder, it is similar in size to
many ultrafast (sub-millisecond) folders that appear to exhibit so-called “downhill”
folding. Hence, we would like to understand the structural features that limit the
overall folding rate. As in a macroscopic two-state model, the highest-flux pathways
in our mesoscopic model are amn and aln direct routes from disordered to
structured macrostates, reminiscent of nucleation-condensation. These pathways by
themselves, however, account for only ~10% of the total flux, and the structural
diversity seen in all pathways is reminiscent of more hierarchical folding models such
as diffusion-collision. Thus, we sought to more fully study the 14 macrostates
transited by the top ten folding pathways.
Figure 13. A 2000-state Markov State Model (MSM) was built using a lag time of 12 ns. Shown is the
superposition of the top 10 folding fluxes, calculated by a greedy backtracking algorithm (see
Appendix C). These pathways account for only about 25% of the total flux, and transit only 14 of
53
the 2000 macrostates (shown labeled a-n, for convenient discussion). The visual size of each state
is proportional to its free energy, and arrow size is proportional to the inter-state flux.
To examine structural changes along the folding reaction, we considered three
main native structural elements: the central helix (), the pairing of strands 1 and 2
(12), and the pairing of strands 1 and 3 (13). To quantify the extent of native-like
structuring for each of these elements we calculated QQ12 and Q13, respectively
(see Appendix C for details). The Q-value is a number between 0 and 1 that quantifies
the extent of native-like contacts. We then examined, for each macrostate, the Q-
values in relation to the pfold value (committor), a kinetic reaction coordinate. The pfold
value is computed from the macrostate transition matrix (5, 37, 87).
This analysis yields several key insights into the folding mechanism of
NTL9(1-39) on the mesoscale. We find the “unfolded” state a is compact, and
contains a baseline level of residual native-like structure, with Q near 0.5, and Q12
and Q13 near 0.2. In general, across the 14 macrostates studied, Q-values increase as
pfold values increase, although the relative balance of QQ12 and Q13 varies,
indicating pathway heterogeneity: i.e. native-like structures can form in different
orders (Figures 14, Figure 55, Figure 56). An exception to this, however, is observed
for 12 strand pairing. Only for macrostates with pfold > 0.5 (states g-n) does
appreciable 12 strand pairing occur (Figure 15). This suggests that the formation of a
local strand pair (12), rather than a nonlocal strand pair (13), is rate-limiting. This
effect is not predicted by strictly topological models of folding in which loop closure
entropy loss dominates (88), but instead may result from sequence-specific details.
Unlike the 13 strand pair, which has a small interaction surface stabilized by
hydrophobic contacts, the 12 hairpin contains seven of the protein’s eight lysine
residues, and three of its five glycine residues in a flexible loop region, features which
may imbue 12 with larger barriers to folding. This proposed role of 12 is also
consistent with the large changes in kinetics and stability seen experimentally for
mutations in the 12 hairpin (78).
54
Figure 14. The 14 macrostates involved in the top ten folding pathways, plotted along structural and
kinetic reaction coordinates. The balance between native-like helix and sheet structure is quantified
by Qα – (Qβ12 + Qβ13)/2 (vertical axis), and progress along the folding reaction is quantified by the
pfold (committor) value (horizontal axis). It can be seen that the “unfolded” state (a) contains
residual native-like helical propensity, and that pathways involving various ordering of native-like
helix and sheet formation are possible.
Figure 15. Q-values, which capture the extent of native-like structures, plotted versus pfold (committor)
values. The lines are to guide to eye.
It is natural to compare our results with previous unfolding simulations of
NTL9(1-39) K12M by Snow et al. (89). In that work, a detailed characterization of the
55
transiti
The above results suggest that existing forcefield models using implicit solvent are
ough to fold proteins ab initio at long time scales (milliseconds),
on state ensemble required the definition of strand-pairing reaction coordinates
corresponding to 12 and 13 formation. In our MSM analysis, no such pre-definition
is required. Snow et al. also note the difficulty in resolving kinetic intermediates not
captured by the chosen order parameters. Indeed, our structural analysis can resolve
subtle kinetic intermediates within the native basin, corresponding to alternative
rearrangements of the 12 hairpin loop (Figure 57).
CONCLUSIONS
indeed accurate en
opening the door to simulating more structurally complex proteins. Moreover, our
work demonstrates that there need not be a single pathway or single, dominant
mechanism for the folding of a given protein: since the theories proposed for how
proteins fold are based on broadly relevant physical principles, it is natural to imagine
that multiple mechanisms could be simultaneously present, but that the sequence of
the protein, coupled with the chemical environment would control the balance to
which each mechanistic pathway is seen.
56
CHAPTER 4: PROTEIN FOLDED STATES ARE KINETIC HUBS
This chapter was taken from: Bowman GR & Pande VS (2010) Protein folded states
are kinetic hubs. Proc Natl Acad Sci U S A 107:10890-10895.
ABSTRACT
Understanding molecular kinetics, and particularly protein folding, is a classic grand
challenge in molecular biophysics. Network models, such as Markov State Models
(MSMs), are one potential solution to this problem. MSMs have recently yielded
quantitative agreement with experimentally derived structures and folding rates for
specific systems, leaving them positioned to potentially provide a deeper
understanding of molecular kinetics that can lead to experimentally testable
hypotheses. Here we use existing MSMs for the villin headpiece and NTL9, which
were constructed from atomistic simulations, to accomplish this goal. In addition, we
provide simpler, humanly comprehensible networks that capture the essence of
molecular kinetics and reproduce qualitative phenomena like the apparent two-state
folding often seen in experiments. Together, these models show that protein dynamics
are dominated by stochastic jumps between numerous metastable states and that
proteins have heterogeneous unfolded states (many unfolded basins that interconvert
more rapidly with the native state than with one another) yet often still appear two-
state. Most importantly, we find that protein native states are hubs that can be reached
quickly from any other state. However, metastability and a web of non-native states
slow the average folding rate. Experimental tests for these findings and their
implications for other fields, like protein design, are also discussed.
57
INTRODUCTION
Molecular kinetics has fascinated biophysicists and biochemists for decades. From a
biophysical point of view, it remains a mystery how systems with so many possible
configurations can self-organize with such specificity and rapidity, carry out catalysis,
and trigger signaling cascades. From a biomedical standpoint, protein misfolding
causes many debilitating diseases, including Alzheimer’s, Huntington’s, and
Parkinson’s diseases (90). Understanding how proteins fold is a logical first step in
understanding how they misfold and, more importantly, how to prevent or recover
from misfolding; indeed, this approach is already proving valuable (40). Furthermore,
a better understanding of protein folding mechanisms could lead to more efficient
structure prediction (91, 92), for use in high throughput proteomics and studies of
systems that defy experimental characterization, and better models for molecular
kinetics could aid in computational drug and protein design.
What would the ultimate theory of molecular kinetics look like though? A
natural way of answering this question is by analogy to well established theories, such
as Schrodinger’s equation in the successful field of quantum mechanics. On the one
hand, computational solutions to Schrodinger’s equation have yielded quantitative
agreement with and prediction of experimental observables. However, equally
important is this theory’s ability to yield insight into simple systems, such as the
particle in a box, for the purposes of gaining an intuition for fundamental principles,
like the quantization of energy and the role of molecular orbitals. Likewise, the
ultimate theory of molecular kinetics should be capable of scaling from sophisticated
models capable of quantitatively predicting experiments to simple models which yield
mechanistic insight. At even the most fundamental levels of this hierarchy, such a
theory ought to be at least qualitatively consistent with experimental observations and
be capable of generating experimentally testable hypotheses. In particular, such a
theory ought to provide insight into protein folding as success in describing such
drastic conformational changes would be evidence for the theory’s ability to describe
less extreme ones.
58
We propose that networks of metastable, or long-lived, states (4, 9, 33, 55)
could fulfill this role because they are implicit in even the most simple protein folding
models; examples include U↔N and U↔I↔N where U is the unfolded state, I is an
intermediate, and N is the native state. Networks called Markov State Models (MSMs)
make these implicitly considered properties explicit and have the potential to provide
complete maps of a protein’s free energy landscape, with nodes corresponding to
metastable states (or free energy basins) and edges representing the probabilities of
transitioning between pairs of these states (3, 4, 6, 9, 33, 55).
A number of recent works have provided validation for these networks by
showing that they can yield quantitative agreement with experimentally derived
structures and folding rates (4, 5, 12, 93). In particular, the predicted native state from
our villin model (based on calculated free energies) had an RMSD to the crystal
structure of ~1.8 Å (4). The model also correctly predicted quantitative details of the
kinetics, such as the absolute folding rate (to logarithmic accuracy). This degree of
accuracy in predicted free energies, structures, and rates is crucial as all experimental
measurements are functions of these properties. In all, the agreement between theory
and experiment leads us to the conclusion that our models provide a sufficiently
accurate reflection of reality.
To further flesh out this potential theory of molecular kinetics, we have delved
into the nature of the free energy landscapes of the villin headpiece (HP-35 NleNle)
(56) and a 39 residue fragment of NTL9 (78). Furthermore, because complex networks
for real systems are difficult to comprehend, we construct simple, generic models that
capture qualitative phenomena like apparent two-state folding and provide an intuition
for molecular kinetics. Together, these models allow us to assess existing theories,
which describe folding as a two-state process characterized by cooperative transitions
across a dominant free energy barrier separating a rapidly mixing unfolded ensemble
from the native state (94, 95).
59
The remainder of this paper will be organized around three key results. First,
protein free energy landscapes can yield apparent two-state behavior even in the
absence of a single dominant barrier. Second, protein unfolded states are
heterogeneous, having multiple basins that interconvert more rapidly with the native
state than one another. Third, protein native states are kinetic hubs: it is possible to
reach them relatively quickly from anywhere in a network but it is also possible to get
stuck in a web of non-native states.
RESULTS & DISCUSSION
APPARENT TWO-STATE BEHAVIOR CAN OCCUR IN THE ABSENCE OF A KINETICALLY
RELEVANT TWO-STATE DECOMPOSITION.
Many proteins appear to fold via a single cooperative transition from a rapidly mixing
ensemble of unfolded conformations to a well defined native structure (94, 96).
However, based on chemical intuition, one would expect to find many more
metastable states, corresponding to the numerous favorable interactions that could
form in the absence of the full native structure as well as dynamics within the native
state. To reconcile these points, one typically assumes a single dominate free energy
barrier that serves as the rate limiting step for folding. Other barriers are often
assumed to be small relative to the thermal energy (or at least to the dominant barrier)
and the equilibrium probability of any intermediate is assumed to be too small to
detect.
However, in some cases modeling experimental data requires the use of at least
three states (97-99) and simple toy models have shown that even three-state systems
can yield apparent two-state behavior (100). Thus, it is natural to hypothesize that
many systems may have more complex arrangements of metastable states (9, 10, 101)
yet still exhibit apparent two-state behavior.
60
To test this hypothesis, we first turn to an MSM for the villin headpiece. This
MSM was recently built from atomistic simulations and, by assuming stochastic jumps
between its states, was shown to give quantitative agreement with experimental
structures and folding rates in addition to recapitulating the raw simulation data (4).
Thus, the presence of numerous metastable states in this model would be strong
evidence for their actual existence and the stochastic nature of transitions between
them. Indeed, with a lagtime on the order of 10ns, analysis of this MSM reveals the
existence of at least 500 metastable states. At least 2,000 are found for NTL9 (93).
The free energy barriers between our villin states have an average height of about 5.9
(+/- 2.5) kT (see Appendix D for details), indicating that they are non-trivial and
potentially detectable. Moreover, no single dominant barrier is apparent.
To better understand the system specific results from our all-atom models, we
now consider three simple models for dynamics capable of providing insight into
protein folding in general. Each of these networks has six metastable states and is
depicted in Figure 16. These models have a single folding pathway (S), parallel
folding pathways (P), and a heterogeneous unfolded state (H, with multiple unfolded
basins that each interconvert more rapidly with the native state than with one another)
as discussed in the Materials & Methods section.
61
Figure 16. Three representative networks each having unfolded state(s) (U and Ui), intermediates (Ii),
and a native state (N). S has a single pathway, P has parallel pathways, and H has a heterogeneous
unfolded state.
One may be tempted to associate the states in these models with folding nuclei
(102), pre-organized secondary structure (103), foldons (104), or the elements of some
other model of protein folding (53). However, we simply require that they all be
metastable. That is, a system within one state is more likely to stay there than to
transition to a different state. Moreover, we propose that the concept of metastability
unifies many of the previously proposed folding mechanisms, each of which describes
some systems better than others, as all consist of basic units that are stable on some
timescale.
We can now imagine monitoring stochastic transitions within each of these
representative systems (or ensembles thereof) with a device that can only detect the
native state. This hypothetical setup is equivalent to experiments wherein unfolded
molecules are allowed to relax to an observable folded state where they are trapped to
prevent unfolding and refolding. Figure 17 shows that such an experiment yields the
exponential behavior typical of an ideal two-state system. In fact, exponential fits to
the data after the initial lag phase only give slight underestimates of the true Mean
First Passage Times (MFPTs) between the unfolded and folded states (Table 1). Thus,
even these simple systems are qualitatively consistent with both stochastic jumps
between numerous metastable states and apparent two-state behavior. This is
particularly surprising for model H since it cannot be divided into a single, rapidly
mixing unfolded basin separated from the native state by one dominant barrier (i.e. it
is not two-state).
62
Figure 17. Distributions of the first folding times for the simple networks S, P, and H are shown in
panels A, B, and C respectively. The blue lines are exponential fits to the data after the initial lag
phase.
A kinetic perspective on our simple networks helps to explain why two-state
behavior is often observed even when there are many large barriers. As discussed
previously, when there is a single dominant rate then faster transitions will tend to be
lost in the noise. Multiple slow rates will also be lost in the noise if they are too
similar. Moreover, this same logic applies even when there are multiple folding routes
from different starting points (and thus no kinetically relevant two-state
decomposition). Thus, observing anything other than mainly single exponential
kinetics requires a delicate balance wherein the slowest rates differ sufficiently to
distinguish them but not so much that one dominates the rest, not to mention
extremely precise measurements.
Fortunately, there is ample evidence that achieving this balance and the
precision necessary to detect it are possible. Multi-exponential behavior is often
consistent with the experimental data, but fit to stretched exponentials (105, 106).
Increasing the temporal resolution of single molecule pulling experiments has also
steadily revealed more metastable states and kinetic measurements can be probe
dependent (107, 108). We propose that the ability to simultaneously monitor multiple
degrees of freedom (such as extension and FRET) in single molecule experiments
63
would reveal even more metastable states, particularly if MSMs were used to choose
the number of probes employed and their placement.
PROTEINS HAVE HETEROGENEOUS UNFOLDED STATES WITH MULTIPLE BASINS THAT
INTERCONVERT MORE RAPIDLY WITH THE NATIVE STATE THAN EACH OTHER.
We now investigate which of the simple network topologies is most representative of
real protein free energy landscapes. As a first step, we have calculated that every state
can reach the native basin of our villin model in one or two steps. This eliminates the
possibility of a single pathway since states with that topology could require up to 499
steps to reach the native basin.
Determining whether the parallel pathway model (95, 109, 110) or the
heterogeneous unfolded state model is more representative of villin requires a
definition of the unfolded state(s). Since every non-native state can reach the native
basin in one or two steps it is natural to label every state that is not directly connected
to the native state (332 in all) as unfolded and all other non-native states (167 in all) as
intermediates.
Taking this definition, we can now examine the distribution of MFPTs from
each unfolded state to the native state as well as the distribution of MFPTs between all
pairs of unfolded states. Doing so reveals that the average MFPT to the native state is
880 (+/-270) nanoseconds, in reasonable agreement with the experimentally predicted
folding time of 720 nanoseconds (56). Moreover, this value is much lower than the
average MFPT between pairs of unfolded states (~370 microseconds), as shown in
Figure 18A and 18B. Considering every non-native state as part of the unfolded
ensemble also gives similar distributions (Figure 59), implying that these results are
robust to the exact definition of the unfolded state. Similar results are found for NTL9
as well (Figure 60). Thus, we can conclude that the heterogeneous unfolded state
model is most representative of our villin and NTL9 models and possibly proteins in
64
general. This result is in contrast to existing theories of protein folding, which assume
rapid equilibration within the unfolded ensemble (95, 111, 112).
Figure 18. Relaxation of villin from 500 state model. Distributions of the MFPTs from (A) unfolded
states to the native state and (B) between unfolded states. (C) Relaxation kinetics with a 10:1
signal-noise ratio (black curve with Gaussian noise) and a single exponential fit (blue curve with
τ≈810 ns).
Examination of representative structures suggests that non-native interactions
(often in the context of relatively compact conformations) and the enormity of
conformational space are responsible for slow transitions between unfolded basins
(Figure 61). Non-native contacts can easily have free energies on the order of native
contacts, making non-native states reasonably metastable. Once a set of non-native
contacts is broken, the probability of forming a particular set of other non-native
contacts is quite small due to the large number of other possibilities. This small
probability is equivalent to a slow rate. In contrast, evolutionary pressure to fold
makes transitioning to the native state reasonably probable, which equates to fast
folding relative to slower transitions between unfolded basins.
The tight distribution of MFPTs to the native state is also consistent with our
explanation of apparent two-state behavior. Due to experimental noise, it is difficult to
justify using more than one or two exponentials to fit the relaxation of our coarse-
grained villin model with 500 states, as shown in Figure 18C. Only with an extremely
high signal to noise ratio can one accurately identify the deviations from single
65
exponential relaxation shown in Figure 62. We also note that more fine-grained
models for villin can capture the burst phase in its relaxation (Figure 63) but here we
emphasize the ability of our coarse-grained model to capture the apparent two-state
behavior that dominates this system’s relaxation (56).
Our ability to reconcile our model with existing experimental data on the
nature of the unfolded ensemble (specifically under native conditions, as opposed to
the more rapidly mixing denatured state) indicates that more experiments will be
required to definitively falsify or support our conclusions. For example, Nettels et al.
have reported a 50 ns global relaxation time within the unfolded ensemble (113). Our
model, however, would suggest that this may be due to relaxation within individual
unfolded basins, not between them. This hypothesis is consistent with recent
measurements of slow dynamics in the unfolded ensemble from the Lapidus lab (114,
115). Therefore, we suggest that this may be an interesting direction for future
experimental work. In addition to existing methodologies for probing the unfolded
ensemble, single molecule experiments monitoring multiple degrees of freedom could
help to falsify or support our conclusions.
If our heterogeneous unfolded state model is indeed generally true then protein
folding kinetics cannot be accurately described by two-states separated by a single
barrier. Instead, folding must be understood in terms of multiple pathways starting
from a number of distinct states. Mixing between pathways adds another layer of
complexity to the folding process. Modeling the effects of mutations will thus require
considering changes in the relative free energies of numerous states and barrier
heights. Understanding the global effects of small changes on networks will likely also
be important for protein design.
66
A NATIVE HUB ALLOWS RAPID FOLDING BUT PROTEINS CAN STILL GET STUCK IN A WEB
OF NON-NATIVE STATES.
The accessibility of villin’s native state implies the hub-like connectivity characteristic
of small-world and scale-free networks (116, 117). We can test this hypothesis by
counting the number of connections observed between states because only those
transitions with probabilities above some threshold are observed with our finite
sampling (all transitions would be observed with infinite sampling). Examining
subsets of the states independently, one finds that the average degree (or number of
connections) increases as one moves from the unfolded states to the native basin. The
unfolded states have an average degree of 12 while the intermediate states have an
average degree of 25. The native state acts as a hub, connected to 167 other states.
Similar results are found for a small β-sheet peptide (17) and NTL9.
Reduced connectivity between non-native states results in slow dynamics
within the unfolded ensemble. This connectivity contradicts other models, which
predict bottlenecks close to the native state and high connectivity in non-native
regions (95, 110, 112, 118), as depicted in Figure 19A. A more thorough discussion of
the similarities and differences between our model and those proposed previously is
given in the next section.
67
Figure 19. Schematic diagrams of funnel and native hub models having unfolded states (U),
intermediates (I), and native states (N). (A) A network description of a folding funnel with nodes
corresponding to individual conformations and a bottleneck near the native state. (B) A native hub
model with metastable nodes. The size of each node in (B) is correlated with its equilibrium
probability and the connectivity falls off as one moves away from the native state.
The native hub explains how villin folds so quickly. Just as there are only
about six degrees of separation between people in the US (119), it is possible to reach
68
villin’s native state in one or two jumps (each 15 ns). Therefore, it is possible to fold
from anywhere in the landscape in 30 ns or less. This result is consistent with recent
experimental work showing that the transition path time between the unfolded and
native ensembles can be as much as four orders of magnitude faster than the average
folding time (120) and likely results from evolutionary pressure to fold quickly.
Due to the kinetic proximity of the native state with a 15 ns lagtime, we see
that villin can fold in just 30 ns; however, such trajectories are rare because the
metastability and connectivity of non-native states makes taking a direct route to the
native state improbable. Instead, villin will often spend considerable time in a web of
non-native states before finally folding, resulting in an average folding time on the
microsecond timescale. In the future, it will be interesting to test whether slower
folding proteins have unfolded states further from the native one or just more strongly
metastable states, which equates to higher barriers and slower transitions between
states. Preliminary analysis of NTL9 suggests every basin can reach the native state in
5 steps (~100 nanoseconds) or less.
We have also found a rough correlation between the connectivity of states and
their equilibrium probabilities. The average probabilities of unfolded and intermediate
states are ~0.0005 and ~0.004, respectively. The native state has an equilibrium
probability of ~0.2. Figure 19B shows a schematic of a protein folding network that
attempts to capture all of these observations in a humanly comprehendible manner. All
of these observations are in qualitative agreement regardless of the degree of lumping;
that is, whether one uses smaller and more numerous states to capture more local
minima in the landscape or fewer and more voluminous states to obtain an even more
coarse-grained model. While one may be tempted to consider Figure 19B merely an
alternative depiction of a funnel, we emphasize that the kinetic connectivity of the
native state and lack of connectivity within the unfolded ensemble are important
qualitative deviations from traditional funnel theory (95).
69
An important methodological consequence of the network topology found here
is that many short, parallel simulations (or experiments) started from arbitrary initial
points are an excellent way of exploring the entire free energy landscape. In the
extreme case of using a single starting point, one could still reach every free energy
basin despite the presence of numerous metastable states so long as each simulation
was longer than the diameter of the network (the minimal time that allows one to reach
any state from an arbitrary starting point). However, reaching every state would be
impossible with simulations that were shorter than the diameter of the network. Thus,
our network theory provides an alternate explanation for the previously noted need to
have simulations longer than some minimal lag phase, which was then attributed to the
need to equilibrate within the unfolded state before folding in two-state systems (121).
Another simple but more efficient strategy would be to start simulations from
multiple conformations dispersed throughout phase space and run them long enough to
ensure mixing between them and coverage of the entire space. In fact, Figures 20 and
Figure 64 how that such a scheme is actually more valuable than a few long
trajectories, using a relative entropy metric for MSMs from Ref (18) to measure the
information content of different datasets relative to our validated villin model.
However, this trend can be seen to break down for simulations that are insufficiently
long or too few as they are unlikely to reach every state or traverse every possible
pathway between pairs of states. The simulation length at which this breakdown
occurs decreases as the number of simulations increases though. Even better
performance can be obtained using adaptive sampling algorithms (18, 19), which
direct sampling to where it is needed most to improve a model.
70
Figure 20. Distance between the final villin MSM and MSMs constructed from subsets of the data
(varying trajectory length and number of trajectories). Distance is measured by a relative entropy
metric (see Appendix D for details). Black lines are contours of equal amounts of data. No data was
available for the upper-right portion of the graph.
COMPARISON TO PREVIOUS THEORIES FOR PROTEIN FOLDING.
There is a long history of theoretical models for protein folding (53) so it is important
to put our work in the context of these previous theoretical approaches. In particular,
folding funnel models (95, 112, 118) have dominated much of how the field currently
conceptualizes protein folding and hence it is natural to compare our model to such
theories. One of the most similar funnel categories is type0B, which is characterized
by overall downhill folding interrupted by a glass transition along the reaction
coordinate (95). While this regime does include slow dynamics between compact
states, it also results in a small number of folding pathways relative to higher
connectivity in the unfolded ensemble. In addition, this and other previous funnel-
based models have explicitly described rapidly interconverting unfolded states, as
reflected in the “bottleneck” discussed in previous works (110, 111), as well as the
71
choice of structurally-based reaction coordinates like the number of native contacts
(Q) (95, 111), which directly requires that dynamics along orthogonal degrees of
freedom, such as interconversion between unfolded conformations, is rapid compared
to folding. In contrast, we find a large number of folding pathways, slow dynamics
between unfolded states relative to folding, and no glass transition. Our folding rates
are also quite similar, rather than the different rates characteristic of the folding
pathways in type0B folding.
Other funnel models have recognized the possibility of a large number of
folding pathways (95, 109, 118), but still in the context of fast dynamics within the
unfolded basin relative to slower transitions to the folded state. Some have even gone
so far as to assume global connectivity (122, 123); however, even these emphasize that
local connectivity would dominate in the full dimensional conformational space and
global connectivity only arises when projecting onto a few order parameters.
Furthermore, they argue global connectivity will not give an activation barrier and,
therefore, these models are primarily intended for studies of downhill folding or the
early activationless stages of folding. Our model, on the other hand, has a native hub
and slow dynamics in the unfolded state relative to faster folding regardless of the
degree of coarse-graining one employs. We also demonstrate that this can result in
apparent two-state folding (i.e. activated kinetics) and that this occurs in non-downhill
folding proteins, such as the millisecond folding NTL9.
CONCLUSIONS
Many biological systems, ranging from signaling pathways to social networks, can be
most naturally described as networks. As a field, we have now established a new level
to this hierarchy: a network theory for molecular kinetics that is able to map out the
free energy landscapes of proteins and other macromolecules in their entirety.
Previous work has demonstrated that this network theory is capable of
quantitative agreement with experiments (4, 5, 12, 93) and we have now shown that it
72
can also scale down to simple, generic models. Using this theory at both the
quantitative and qualitative levels, we have provided an intuition for conformational
changes as drastic as protein folding and this intuition has led to experimentally
testable insights into the nature of protein free energy landscapes.
We have focused on three new insights from these network models, which
appear to hold regardless of the degree of coarse-graining one employs and can be
reconciled with current experiments. First, even models that defy a kinetic
decomposition into two states often give rise to apparent two-state behavior. Second,
proteins have heterogeneous unfolded states (multiple basins that each interconvert
more rapidly with the native state than with one another, preventing a kinetic
decomposition into two states). Third, proteins have a native hub. Thus, it is possible
to fold quickly from anywhere in the landscape but proteins often get stuck in a web of
non-native states before finally folding, greatly increasing the average folding time.
These properties are a natural result of reasonably strong non-native
interactions and the enormous number of non-native conformations a protein can
adopt, in combination with evolutionary pressure to fold quickly (for example, to
avoid aggregation). Therefore, we suggest that these conclusions are likely true of
proteins in general. Our approach also unifies other models for protein folding by
recognizing that each of them builds upon elements, whether they are called folding
nuclei (102) or foldons (104), which correspond to different types of metastable states.
We look forward to a fruitful future of drawing on network theory to better
understand molecular kinetics and guide experiments probing both general properties
and system specific details. In particular, can one reinterpret the many experiments
that have been analyzed under a two-state assumption? If so, that could shed light on
the chemistry of the underlying structures that leads to the network topology and
dynamics described here. Moreover, can further experiments be designed to directly
probe the unfolded state under native conditions (rather than with denaturant or high
temperature, where mixing is more rapid) to directly test the predictions made here?
73
We also hope to explore how the methodologies developed for building and
understanding biomolecular networks may be applicable to other types of networks,
especially as network theorists attempt to develop a general framework for
understanding network dynamics.
MATERIALS & METHODS
ATOMIC RESOLUTION PROTEIN FOLDING SIMULATIONS AND NETWORKS.
Ref (4) describes the use of the MSMBuilder package
(https://simtk.org/home/msmbuilder/) (10) to construct an MSM with 10,000
microstates for the villin headpiece (HP-35 NleNle). This model was based on ~450
all-atom, explicit solvent simulations, each up to 2 μs in length, for a total simulation
time of 354 μs (58). While the longest timescale transitions in the model from Ref (4)
were found to be Markovian, implying memory-less transitions between metastable
states, not every state was metastable. We used MSMBuilder to lump kinetically
related microstates into 500 metastable macrostates to ensure a direct correlation
between states in the MSM and free energy basins, as described in the SI. This is
equivalent to common experimental analyses in which the potential is smoothed and
the friction is rescaled. We note, however, that the free energy landscape for this
system is actually a hierarchy of basins so it is possible to build many valid MSMs
with different numbers of states. As a result, one would not expect there to be exactly
500 experimentally detectable states. Regardless of the resolution at which one
examines this hierarchy, however, requiring that each state is metastable ensures that
they are directly related to a free energy basin. Thus, our networks of metastable states
are an important step beyond previously described networks, which often used simpler
approximations to define state boundaries and the transition rates between states (17,
95, 110, 124, 125). An additional 40,000 simulations, each up to 400 ns in length (for
a total simulation time of 14 milliseconds), were also assigned to this MSM to explore
the effect of using more simulations.
74
Preliminary results for a 39 residue fragment of NTL9 are based on an MSM
built from ~1.5 milliseconds of simulation in implicit solvent with a different force
field (93). Similarities between these two systems thus suggest our results are not a
force field artifact.
SIMPLE MODELS.
We have designed three simple networks, depicted in Figure 16, that capture the
essence of various protein folding mechanisms. Each of these models has six
metastable states with approximately the same equilibrium and transition probabilities
so that differences between their behaviors may be attributed to differences in their
topologies (see the Appendix Dfor details).
The first model (S) has a single folding pathway. This model is a natural
extension of the common U↔I↔N model (97, 126) and is often used to justify the
expense of running long simulations as shorter ones could fail to reach every state.
The second model (P) has parallel folding pathways. Parallel folding pathways
have been proposed for a number of systems (58, 98, 99, 109). In addition, this model
emphasizes the need to observe numerous folding and unfolding transitions to obtain
sufficient statistics on the entire process. The increased connectivity relative to S also
results in faster timescales.
The third model (H) has a heterogeneous unfolded state—multiple unfolded
basins that each interconvert more rapidly with the native state than with one another.
Thus, there is no kinetic decomposition of this model into two states, one folded and
one unfolded. This model was inspired by a growing body of work on the presence of
deep minima and gutters in unfolded regions of conformational space (114, 115, 127-
129).
75
CHAPTER 5: ATOMISTIC FOLDING SIMULATIONS OF THE FIVE HELIX
BUNDLE PROTEIN LAMBDA6-85
This chapter is in preparation as: Bowman GR, Voelz VA, Ensign DL, & Pande VS
(2010) Atomistic folding simulations of the five helix bundle protein λ6-85.
ABSTRACT
Understanding protein folding is a long-standing problem with important medical
applications, such as elucidating the role of protein misfolding in diseases like
Alzheimer’s. Solving the folding problem will ultimately require a combination of
theory and experiment, with theoretical models providing an atomically-detailed
picture of both the thermodynamics and kinetics of folding and experimental tests
grounding these models in reality. However, modeling long timescale dynamics (e.g.
microseconds, milliseconds, and beyond) with sufficient statistical accuracy and
chemical detail to make a quantitative connection with experiments is extremely
challenging. Here we report significant progress in this direction: an atomistic model
of the folding of an 80-residue fragment of the λ repressor protein with explicit solvent
that captures dynamics on 10 millisecond timescales. This advance greatly increases
the common ground accessible to both theory and experiment (both in terms of system
size and long timescales) and leads to a number of predictions that warrant further
experimental tests. For example, our model’s native state is a kinetic hub and
biexponential kinetics arise from the presence of many free energy basins separated by
barriers of different heights rather than a lack of barriers (the previously proposed
downhill scenario).
76
INTRODUCTION
Understanding protein folding is a long-standing problem with important medical
applications, such as elucidating the role of protein misfolding in diseases like
Alzheimer’s. Solving the folding problem will ultimately require a combination of
theory and experiment, with theoretical models providing an atomically-detailed
picture of both the thermodynamics and kinetics of folding and experimental tests
grounding these models in reality. However, modeling long timescale dynamics (e.g.
microseconds, milliseconds, and beyond) with sufficient statistical accuracy and
chemical detail to make a quantitative connection with experiments is extremely
challenging. Much progress has been made with small, fast-folding proteins but can
the methods used scale to larger, slower systems? Here we report significant progress
in this direction: an atomistic model of the folding of an 80-residue fragment of the λ
repressor protein with explicit solvent that captures dynamics on a 10 millisecond
timescale.
This advance builds on a growing body of work on describing molecular
kinetics with Markov State Models (MSMs). MSMs are essentially maps of a
molecule’s conformational space (1-3, 6). However, instead of having towns
connected by roads labeled with speed limits, MSMs have metastable states (sets of
rapidly interconverting conformations) connected by edges giving the probability of
going from one state to another. One can exploit the kinetic definition of states in an
MSM to scale from high-resolution models capable of quantitative agreement with
experiments to low-resolution models that provide an intuition for the system. In
addition, one can break up slow processes like protein folding into many small steps
that can be studied with short, parallel simulations.
The proteins studied with MSMs to date have generally been small and fast
folding (see Refs (3) and (2) for reviews). For example, we have built a model for a
35-residue mutant of the villin headpiece (4) that folds on the μs timescale. The native
state of this model (i.e. lowest free energy state) was within 1.8 Å of the crystal
77
structure, an important achievement given that all the simulations used to build the
model started from unfolded conformations. Noe et al. have built an MSM for a Pin
WW domain (5) (34 residues, μs folding time) and Voelz et al. have built an MSM for
a 39-residue fragment of NTL9 (93) (the first millisecond folder to be modeled with
MSMs). The ability of these models to predict structures, thermodynamics, and rates
indicates they should be capable of predicting any experimental observable, since all
are functions of these properties.
To test whether the MSM approach can scale to larger systems, we have
applied it to the D14A mutant of an 80-residue fragment of the λ repressor protein
(72). Full length λ repressor is a 236-residue protein capable of dimerizing and
binding to DNA, maintaining the λ phage in the lysogenic state and regulating its own
expression. Figure 21A shows the crystal structure of a 92-residue fragment that can
still dimerize and bind to DNA (130, 131). Based on this structure, Huang and Oas
selected an 80-residue fragment (λ6-85) that favors the monomeric state (Figure 21B),
making it appropriate for folding studies (132). This fragment was one of the first sub-
millisecond timescale folders to be discovered. Subsequently, a number of mutants of
λ6-85 have been found to fold on faster timescales (72, 133-135). The D14A mutant is
one of the fastest folders, having an approximately 2 μs molecular phase and an
approximately 10 μs activated phase (72). These timescales have been attributed to
downhill (or barrierless) and two-state folding, respectively.
Figure 21. (A) The crystal structure of the λ1-92 dimer bound to DNA (PDB code 1LMB). (B) A model
of λ 6-85 with the Trp22-Tyr33 pair monitored in T-jump experiments space-filled.
78
The fast timescales reported for D14A make it a prime candidate for atomistic
molecular dynamics simulations combined with MSMs, which can now capture
millisecond timescales (93). We have run 3,265 trajectories with explicit solvent at
370 K. Each one is up to 1 μs in length, for an aggregate of 1.3 milliseconds of
simulation. These simulations were started from six initial configurations drawn from
replica exchange simulations in implicit solvent (136). One is native-like, three are
partially unfolded, and two have β-sheets. A more detailed description of our
simulations is given in Appendix E. We then constructed a high-resolution MSM with
30,000 microstates that is appropriate for making quantitative connections with
experiments. A low-resolution model with 5,000 macrostates was created from the
high-resolution MSM to facilitate interpretation of the model. More details on our use
of the MSMBuilder package (10) to construct these models are given in Appendix E.
While no single trajectory visits every state, these MSMs are able to capture long
timescale dynamics by exploiting overlap between our simulations to stitch them
together in a physically and statistically meaningful way. Examination of the implied
timescales of the microstate MSM shows that a five ns lag time yields Markovian
behavior (Figure 65).
RESULTS & DISCUSSION
Analysis of our high-resolution MSM reveals the presence of 10 millisecond
timescales. These timescales are preserved in an independent dataset run at 300 K and
subsamples of the 370 K dataset (Figure 66 and Figure 67), indicating that they are a
robust feature of the simulated system. Do these slow timescales reflect inadequacies
in the simulation parameters (the force field)? For example, λ repressor’s folding time
is known to be sensitive to solvent viscosity (137), so small errors in our
parameterization could easily affect our predicted rates. Or could the experimental
probes and techniques used to date be insensitive to these long timescales? One might
expect D14A, with its sizeable hydrophobic core, to fold on slower timescales given
79
that the wild-type villin headpiece (which is less than half the size of D14A and barely
has a hydrophobic core) is also reported to fold in just under ten μs (73).
To explore these possibilities we mapped out the 10 millisecond
timescale conformational rearrangement. Analysis of our coarse-grained MSM reveals
that this slow timescale corresponds to exchange between a compact β-sheet structure
and the crystal structure through multiple parallel pathways (Figure 68 and Figure 69).
Figure 22 shows a representative pathway between these states from our high-
resolution MSM. First, the compact β-sheet structure expands, breaking apart the β-
sheets. Then helices 1 and 4 begin to form, followed by collapse into a native-like
topology. Finally, the remaining helices form. As in a previous study (138), more
conventional projections of the free energy landscape were less informative (Figure 70
and Figure 71).
80
Figure 22. One of the 10 millisecond timescale pathways labeled with pfold values (the probability of
reaching state H before state A).
The prediction of β-sheet states in the unfolded ensemble under folding
conditions is somewhat surprising for a helical protein, especially since they are well
populated (Figure 72). However, experiments have shown that the unfolded and
denatured states of many systems can have significant populations of compact, β-sheet
structures yet still display the random coil statistics characteristic of expanded
conformations (139, 140). Thus, our prediction of compact, β-sheet structures is not
unreasonable.
As a further test we used our MSM to model the relaxation of a surrogate for
the Trp22-Tyr33 quenching interaction measured in T-jump experiments and a more
81
global metric, the Cα RMSD to the crystal structure (Figure 73). Both have
biexponential relaxation—a characteristic of D14A that has been used to argue that it
is a downhill folder—but the molecular phase is about two orders of magnitude slower
than in experiment (1 millisecond versus 10 μs). However, ignoring simulations
started from β-sheet structures yields better agreement (Figure 74). First, the Trp22-
Tyr33 surrogate has a 1 μs molecular phase and a 4.3 μs activated phase, in reasonable
agreement with the experimental values of 2 and 10 μs. Secondly, the RMSD now
relaxes on different timescales, consistent with observed probe dependent kinetics
(141, 142). Projections of the free energy onto a kinetically meaningful reaction
coordinate (pfold(51)) are not purely downhill, but could be consistent with incipient
downhill folding along parallel pathways (Figure 75). Incipient downhill folding is a
scenario in which a barrier is present but is sufficiently low that its peak is well
populated; therefore, one observes downhill folding (a molecular phase) from the
barrier top and two-state folding (an activated phase) across the barrier.
Based on these results, we cannot conclusively determine whether the stability
of the β-sheet states is a force field artifact or a feature of D14A not yet detected by
experiments. It is possible that short T-jumps simply cannot reach the β-sheet states.
Fully resolving this issue will likely require more experiments and more points of
comparison between theory and experiment. Regardless of the outcome, it is exciting
that MSMs built from atomistic simulations can now capture 10 millisecond
timescales.
The crystallographic state (Figure 22H, probability ~0.09) is not the native
(most stable) state in our model. The native state in our model (Figure 22G,
probability ~0.44) differs from the crystallographic state in that helix five is unraveled
and packed against the side of the protein. This observation is consistent with both the
negligible helical propensity in helix five reported by Agadir (143) (Figure 76) and the
context of this helix in the original crystal structure (Figure 21A), where it is extended
by seven residues. These extra residues form important contacts between the two
members of the dimmer that could stabilize helix five. Truncating the sequence to
82
favor the monomer could lead to a lack of structure in the remaining residues of helix
five, resulting in a strong propensity to fill in the hydrophobic cavity normally
occupied by the corresponding helix in the other member of the dimmer or adopt one
of a number of other well-populated, unstructured conformations (Figure 72). Further
support for this observation comes from the fact that a crystal structure of λ6-85 has
high B-factors in helix five (135) and the stability of this system seems to be
insensitive to mutations in this helix (136). Similar results were also found in a Gō
model study, where helix five tended to un-dock from the rest of the protein (138).
However, Gō models do not include non-native interactions, so helix five was not
found to unravel or pack against the protein. The behavior of a variational model (144)
and a diffusion-collision model (145) also differ from that found here due to the lack
of non-native interactions. However, the diffusion-collision model is similar in nature
to our MSM approach in its use of states and rates. Helix five was also found to be
unstable in replica exchange simulations with implicit solvent (136).
Our MSM for D14A is also consistent with previous reports of native hubs (16,
146). A first hint of this comes from the large number of connections to our native
state (Figure 23). The native state in our model makes direct connections to 98% of
the non-native states while non-native states only connect to 0.1% of the other states
on average. Moreover, the MFPTs to the native state are typically ~10 times faster
than the MFPTs between non-native states, as shown in Figure 24. Therefore,
molecules in non-native states can generally fold faster than they can transition to
other non-native states. The fastest way to transition between two randomly selected
non-native states is then to fold and unfold.
83
Figure 23. The 500 most populated macrostates with sizes proportional to their free energies and
connections between states if transitions between them occurred in our simulations. The native
state (green state with green connections) is a hub. The crystallographic state from Figure 22H is
blue, the compact β-sheet state from Figure 22A is red, and the remaining states are yellow. All of
these states have smaller equilibrium populations and fewer connections than the native state.
Figure 24. Distributions of mean first passage times (MFPTs) between sets of microstates (A) without
weighting the distribution and (B) weighting each MFPT by the equilibrium probability of the
starting state. The solid line is the distribution of MFPTs from non-native to native microstates and
84
the dashed line is the distribution of MFPTs between non-native states. The average MFPT from
non-native states to native ones is about 10 times faster than that between non-native states in (A)
and the difference is even greater in (B). Native microstates were defined as those in the most
populated macrostate. All other microstates were considered non-native.
This hub model presents an alternative to the two-state and downhill models
often used to describe protein folding and interpret experiments. Rather than having a
single dominant barrier or no barrier at all, the hub model has many metastable states
separated by barriers of different heights and numerous unfolded basins that
interconvert more rapidly with the native state than one another. Therefore, there are
many parallel folding pathways. We have already showed that MSMs with native hubs
can predict the dominant two-state behavior and burst phase kinetics of other systems
(16). Here we show that MSMs with native hubs can also predict the biexponential
relaxation of D14A that has previously been attributed to downhill (or barrierless)
folding (72, 147). Our previous work proposed that the native hub results from non-
negligible non-native contacts, which must be broken in order to fold (16, 146). Figure
22 demonstrates this behavior in our model of D14A. Testing the hub model will
require more experiments on the unfolded state under native conditions (rather than at
high temperature or in the presence of denaturant, where the unfolded ensemble is
likely more diffuse).
CONCLUSIONS
The combination of simulations and MSMs can now access ~10 millisecond
timescales for moderately large (~80 residue) systems, greatly increasing the common
ground between theory and experiment. The ability of our MSMs to capture
biexponential kinetics also indicates that proteins previously designated as downhill
folders may actually have many barriers of differing heights. In addition, our model
leads to a number of predictions for D14A: 1) current experiments may be failing to
detect processes on 10 millisecond timescales, 2) there may be significant β-sheet
structure in the unfolded ensemble under native conditions, 3) helix five may unfold
85
and fill a hydrophobic pocket in the native state and lack structure in other well
populated states, and 4) the native state may act as a kinetic hub. Our ability to
reconcile these observations with existing experiments suggests that more
experimental data will be necessary to provide a detailed description of how D14A
folds. We suggest that MSMs could be used to help design such experiments and lead
to important new insights into folding or, at the very least, provide more data for
refining existing force fields and improving the agreement between theory and
experiment.
86
CHAPTER 6: ENHANCED MODELING VIA NETWORK THEORY: ADAPTIVE
SAMPLING OF MARKOV STATE MODELS
This chapter was taken from: Bowman GR, Ensign DL, & Pande VS (2010) Enhanced
modeling via network theory: adaptive sampling of Markov state models. J Chem
Theory Comput 6:787-794.
ABSTRACT
Computer simulations can complement experiments by providing insight into
molecular kinetics with atomic resolution. Unfortunately, even the most powerful
supercomputers can only simulate small systems for short timescales, leaving
modeling of most biologically relevant systems and timescales intractable. In this
work, however, we show that molecular simulations driven by adaptive sampling of
networks called Markov State Models (MSMs) can yield tremendous time and
resource savings, allowing previously intractable calculations to be performed on a
routine basis on existing hardware. We also introduce a distance metric (based on the
relative entropy) for comparing MSMs. We primarily employ this metric to judge the
convergence of various sampling schemes but it could also be employed to assess the
effects of perturbations to a system (e.g. determining how changing the temperature or
making a mutation changes a system’s dynamics).
INTRODUCTION
Molecular dynamics simulations are a powerful means of understanding both the
thermodynamics and kinetics of molecular processes like protein folding and
conformational changes. Unfortunately, such processes are highly sensitive to the
underlying chemical details. For example, point mutations in the amino acid sequence
of a protein may have significant effects on its kinetics (147) and a small number of
87
point mutations can even drastically change the native structure (148). Thus, atomistic
simulations are required to make quantitative connections with experiments (149,
150).
Advances in computing have made it possible to rapidly generate huge data
sets even at this level of chemical detail (79, 151); however, these data sets are still
insufficient. A typical computer can only simulate ~5 nanoseconds/day of protein
folding and would thus take over 500 years to simulate one millisecond, an average
folding time typical of proteins. Whether one is interested in dynamics or merely
equilibrium probabilities, a kinetic perspective on this problem that explicitly
considers the rate of equilibration reveals that metastability, or the presence of long-
lived states that act as “traps”, is a common source of inefficiency.
One approach to dealing with this issue is to make tremendous investments in
specialized software and hardware for generating long simulations (152). While
theoretically sound (153), this serial approach often only results in simulations that are
long relative to standard trajectories. However, a truly-long simulation must be orders
of magnitude longer than the slowest relaxation time so that the probabilities of all
states and pathways can be estimated accurately. Even if such a simulation were
possible, the task of analyzing the data would still remain (152, 154). Moreover, serial
approaches are inherently inefficient, both due to parallelization overhead and, more
importantly, the fact that they waste hundreds of years of computing time waiting for
rare events.
A statistical approach provides a fundamentally different perspective on model
construction. Rather than attempting to generate one realization of an entire process,
one instead aims to generate an ensemble of events in parallel. For example, a number
of methods have been developed for exploiting statistical mechanics to simulate
protein folding more efficiently (69, 84, 155, 156). Most of these approaches rely on
the fact that in two-state protein folding, the waiting time for observing a transition is
exponentially distributed but the actual transition times are quite rapid (120). Thus,
88
proteins often fold much faster or slower than the average folding time. Such
approaches are amenable to commodity hardware and take far less wall-clock time
than a serial approach with an equivalent amount of sampling, particularly when
combined with grid computing (79). Unfortunately, these methods are generally only
applicable to two-state systems and may require simulations of an unknown minimum
length (121). Some multi-state generalizations exist (157) but quickly become
computationally intractable.
Markov State Models (MSMs) extend this work by allowing for a tractable,
multi-state scheme that allows efficient modeling of any system exhibiting
metastability (9). An MSM is a network with nodes corresponding to metastable states
and edges describing the rates of transitioning between pairs of states, akin to a map
with cities connected by roads labeled with speed-limits. Rather than attempting to
generate one realization of an entire process, one can exploit the decomposition of
conformational space into multiple metastable states to gather statistics on each step of
the process independently, allowing a problem to be broken up into more manageable
and trivially parallelizable pieces.
Mathematically, MSMs are represented as transition probability matrices, with
the entry in row i and column j giving the probability of transitioning from state i to
state j within a time interval called the lag time of the model. Building MSMs is a
challenging task but significant progress has been made over the past few years (3, 4,
6, 10), leading to freely available software for automatically constructing these models
(10). While MSMs could be used to analyze truly long simulations, their ultimate
value lies in their ability to facilitate efficient model construction by allowing precise,
parallel determination of the transition rates between states by running many short
simulations from each of them.
Adaptive sampling algorithms for MSM construction take this statistical
approach a step further (12, 19, 20). In adaptive sampling, one first obtains an initial
model of the entire process of interest by any means possible. One then iteratively
89
calculates the contribution of each step of the process to uncertainties in some
observable of interest via Bayesian statistics and runs numerous parallel simulations of
the steps that can lead to the greatest increases in precision until the desired level of
statistical certainty is achieved. Such an approach was recently shown to lead to
dramatic reductions in the statistical uncertainty in the observable of interest relative
to other refinement schemes (19).
However, a number of important questions remain to be answered. First, does
adaptive sampling improve the global model quality or just local components that are
important for the observable of interest? Exactly how much more efficient is adaptive
sampling? And finally, is adaptive sampling capable of discovering previously
unknown components of a model, or is it only able to refine the initial model it is
given?
In this work, we address these questions using an MSM for the villin headpiece
(HP-35 NleNle) that was recently constructed from atomistic simulations with explicit
solvent (4). We then move on to simple models, where the role of the network is clear,
to gain an intuition for our results and test whether such methods could be more
broadly applicable to a wide class of different types of systems. These analyses rely on
a new distance metric for MSMs developed in Section 2.2, which should prove
generally useful for evaluating various sampling schemes and even assessing the
effects of perturbations to a system (like changes in temperature or even mutations).
THEORETICAL UNDERPINNINGS
ADAPTIVE SAMPLING.
In adaptive sampling approaches to MSM construction, simulations are run iteratively
to minimize uncertainties in some property of a model (12, 19, 20). In this work,
adaptive sampling is performed as follows:
1. perform N simulations of L steps starting from a particular starting state(s)
90
2. build an MSM only including those states identified so far
3. calculate the contribution of each state to uncertainty in the slowest kinetic rate
following Ref (19)
4. start N new simulations of L steps distributed amongst the states in proportion to
their contribution to uncertainty in the slowest rate
5. repeat steps 2-4 for some number of iterations
All the MSMs in this work were constructed and analyzed with the
MSMBuilder package (which is freely available at https://simtk.org/home/msmbuilder/)
(10) modified such that transition count matrices were not symmetrized by counting
the transitions that would have been observed if one watched each simulation
backwards.
We note that in the past simulations in each round of adaptive sampling were
all started from the same initial state (the one contributing most to uncertainty in the
quantity of interest) (19). The intuition behind our alteration was that as the number of
simulations (N) becomes large, starting all the simulations from one state would be
excessive as fewer would be sufficient to drastically reduce the uncertainty. Instead, it
would be preferable to allocate some of these excess simulations to reduce
uncertainties in other states’ transition probabilities. Indeed, we have found that our
modified procedure yields better results for sufficiently large N on reasonably
complex networks and gives equivalent results for simple networks and small N.
To demonstrate the utility of this algorithm, we carried out adaptive sampling
with synthetic trajectories generated from transition count matrices. To generate
synthetic simulations from a transition count matrix we first normalize each row to
obtain a transition probability matrix. At each time step (or each lag time), the next
state is chosen according to the distribution of transition probabilities for the current
91
state. The prior described below is not used for these calculations, so the matrices used
to generate trajectories tend to be sparse.
QUANTIFYING THE SIMILARITY BETWEEN MSMS.
In order to monitor the convergence of any sampling scheme, it is important to first
develop a similarity metric that is capable of measuring the global quality of a test
model relative to some reference model. Such a metric would also have broad
usefulness, as there are several reasons for comparing MSMs quantitatively. For
example, this metric could be used to compare MSMs generated by two different
simulation methods allowing one to directly compare the resulting dynamics.
Alternatively, one could compare MSMs generated by two somewhat different, but
related systems, such as comparing the simulations of the dynamics of two point
mutants of a given protein.
We have developed such a distance metric for MSMs that is based on the
relative entropy, which is a common measure of the distance between two probability
distributions in information theory (158) with important physical implications (159).
The relative entropy between two normalized distributions P and Q, over a common
set of outcomes, is
i i
ii Q
PPQPD log)||(
where Pi is the probability of outcome i, P is a reference distribution, and Q is some
test distribution.
An MSM consists of one normalized distribution per state, which gives the
probability of transitioning to each other state within one lag time. We define the
relative entropy between a reference and test MSM, with transition matrices P and Q
respectively, as
92
N
ji ij
ijiji Q
PPPQPD
,
log)||( (6.1)
where Pi is the equilibrium probability of state i, Pij is the probability of transitioning
from state i to state j during one lag time, and N is the number of states. Intuitively,
our relative entropy metric is the sum of the relative entropies between the transition
probability distributions for each state weighted by their stationary probabilities.
One may derive our relative entropy metric for MSMs more formally by
considering that the entropy (H) of a sample path of a stochastic process, normalized
by its length, is also called the entropy rate. An important theorem in information
theory is the following:
Theorem. For an ergodic stochastic process X1, …, Xn
),...,|(lim),...,(1
lim 111 nn
nn
nXXXHXXH
n
For a Markov Chain, the right hand side takes a very simple form, because the
conditional entropy only depends on the previous step, which converges to the
stationary distribution.
In the following, we prove a similar statement for the relative entropy between
the paths of two Markov chains as n goes to infinity. For two Markov chains p and q
with state space Ω, we would like to compute:
)),...,(||),...,((1
lim 11 nnn
XXqXXpDn
For simplicity, let us define lowercase xn = X1, …, Xn. Then, by the
chain rule for the relative entropy, we get:
93
))]|(||)|(())(||)(([1
lim 1111 nnnnnn
nxXqxXpDxqxpD
n (6.2)
Eq. 2.65 in Cover & Thomas (160) defines the conditional relative entropy
above as the expectation of the relative entropy between the conditional distributions
of Xn given xn-1, with respect to the distribution of xn-1. This means that:
Ynnn
ynnnnnnn
YXqYXpDYXp
yXqyXpDyxpxXqxXpDn
))|(||)|(()(
))|(||)|(()())|(||)|((
1
1111
where we have grouped terms with the same final state in the “history" y, which have
the same relative entropy factor, and summed their probabilities to obtain the marginal
probability over Xn-1.
Repeating the step that led to Eq. 6.2 many times yields:
))(||)((]))|(||)|(([1
lim 112
11 XqXpDxXqxXpDn
n
mmmmm
n
If the initial state is deterministic, the last term is just zero. As for the first
term, as n goes to infinity, the distribution of Xm-1 goes to the stationary distribution of
p, which we call μ. Then, using the equation for the conditional entropy,
Z Ynnnn
n ZYq
ZYpZYpZxXqxXpD ]
)|(
)|(log[)|()())|(||)|((lim 11
Since the terms in the series converge to a limit, their Cesaro means
converge to the same limit, so:
Z Ynn
n ZYq
ZYpZYpZXXqXXpD
n]
)|(
)|(log[)|()()),...,(||),...,((
1lim 11
94
The terms p(Y|Z) and q(Y|Z) are just the elements of the transition matrices of
p and q respectively, so this is equivalent to Eq. 6.1.
PRIOR FOR RELATIVE ENTROPY AND ADAPTIVE SAMPLING.
There is always some probability of transitioning between every pair of states, though
these probabilities may be low enough that no actual transitions are observed. To
account for this, as well as to reflect our lack of prior knowledge about the transition
probabilities, we add a pseudo-count of 1/N to every element of the transition count
matrix, where N is the number of states, before normalizing each row to find the
transition probability matrix, as in Refs (19, 161). The intuition behind this choice is
that for a state to exist we must observe at least one count in that state but before
observing any real data the probability of this count leading to any other state is equal.
From a Bayesian perspective, these pseudo-counts equate to a uniform prior. These
pseudo-counts also prevent the relative entropy metric from becoming infinite
whenever a zero is encountered in an MSM’s transition probability matrix. It is often
the case that certain transitions are not observed, so this correction is of great practical
importance.
VILLIN SIMULATIONS AND MSM.
The simulation details for the original ~450 villin simulations are described in detail in
Ref (58). In short, ~450 constant temperature molecular dynamics simulations with
explicit solvent and up to 2 μs in length were run from nine initial configurations
drawn from high temperature unfolding simulations at 373 K. Ref (4) describes the
construction of a 10,000 microstate MSM that faithfully reproduces the raw simulation
data. For the purposes of this work, we lumped these 10,000 microstates into 500
macrostates exhibiting metastability and having an equivalent Markov time (15 ns).
This lumping was done with the MSMBuilder package (10). The macrostates
containing the nine initial configurations used during the real simulations were used as
95
the starting points for adaptive sampling. Simulations of just 30 ns were used for
adaptive sampling.
SIMPLE MODELS.
The transition count matrices for simple models S and P (CS and CP respectively) are
000,9030000
3000,13000
03000,1300
003000,130
0003000,13
00003000,6
SC
and
000,9022000
2000,10220
20000,1220
022000,102
0220000,12
00022000,6
PC
where the entry in row i and column j gives the number of transitions observed from
state i to state j.
Mean first passage times were calculated following Ref (161). The mean first
passage times for S and P are ~13,000 and ~5,000 steps respectively. Other
equilibrium properties can be obtained by normalizing each row to obtain a transition
probability matrix and then solving for the eigenvalues and eigenvectors of this
matrix. For example, normalizing the first eigenvector (e.g. the one corresponding to
an eigenvalue of 1) gives the equilibrium probabilities of each state. Subsequent
eigenvalue/eigenvector pairs give kinetic rates and the states involved in these
96
transitions respectively (9). Once again, the MSMBuilder package (10) was used for
analysis of these models.
Plots of the average relative entropy as a function of simulation number and
length were generated by running 600 simulations of 5,000 steps for each model.
Average relative entropies over 10 random samples of N trajectories from this pool
were then calculated and plotted. Similar plots for our adaptive sampling scheme were
also generated by averaging over 10 independent runs.
RESULTS & DISCUSSION
APPLICATION TO VILLIN MSM.
With these tools in place, we are now in a position to assess the efficacy of adaptive
sampling using a previously calculated MSM for the villin headpiece (4) as a model
system. In particular, we would like to assess two types of efficiency. First, given our
desire to push the envelope of what is possible in a reasonable amount of time, can
adaptive sampling reduce the wall-clock time necessary to achieve a given model
quality? Second, given our desire to mitigate negative impacts on the environment,
can adaptive sampling reduce the amount of resources (in this case computer time)
necessary to achieve a given model quality?
To address these questions we have performed adaptive sampling with a
variable number of simulations per iteration generated from our villin MSM. We then
assume each simulation progresses at a rate of 5 ns/day, a typical value for modern
personal computers, and compare the convergence of our adaptive simulations to the
gold-standard model from Ref (4) (that was validated by comparison to both the raw
simulation data and experiments) with the convergence of a single long reference
simulation to the same gold-standard. Convergence to the gold-standard model is
measured with our relative entropy metric for MSMs (described in Section 2.2).
97
Figure 25A shows that the wall-clock time efficiency of adaptive sampling
scales linearly up to 5,000 simulations per iteration. That is, adaptive sampling with N
simulations per iteration can reduce the wall-clock time necessary to achieve a given
model quality by a factor of N for N as high as 5,000. Using more simulations will
help but will only reduce the wall-clock time by a factor of αN, where α<1. The
crucial result, however, is that one can reduce a calculation that would take decades to
run with traditional methods to a calculation that can be run in a matter of days with
adaptive sampling.
98
Figure 25. Scaling for adaptive sampling of villin as the number of parallel simulations (N) used during
each round is varied. (A) Wall-clock time scaling as N is varied. The black line is a best fit to the
linear portion of the data (circles), which extends up to 5,000 simulations per iteration. (B)
Computer time required to achieve a given model quality (relative entropy) for various sampling
schemes. L refers to one long trajectory and the numbers refer to the number of parallel simulations
used in each iteration of adaptive sampling. All results come from averaging over ten independent
runs. Each step equates to 15 ns.
99
Adaptive sampling can also greatly reduce the resource requirements for
achieving a given model quality. For example, Figure 25B shows the computer time
necessary to achieve a given model quality for one long simulation and adaptive
sampling with a varying number of simulations per iteration. This figure shows that
adaptive sampling requires about half as much computer time to achieve the same
model quality as one long simulation. Once again, the relative efficiency of adaptive
sampling begins to fall off beyond some optimal number of simulations per iteration.
APPLICATION TO SIMPLE MODELS.
To gain an intuition for the applicability of adaptive sampling to other systems, we
have also applied it to two classic network topologies, shown in Figure 26A and
defined more thoroughly in Section 2.5. These models are representative of problems
with metastability, their equilibrium properties can be derived analytically and used as
an unambiguous reference, and truly-long simulations are feasible.
100
Figure 26. (A) The two models, S and P. (B) Distance from the true model (measured via the relative
entropy) as a function of wall-clock time for adaptive sampling versus one long simulation of S
(assuming 5 steps/day to mimic 5 nanoseconds/day in protein folding simulations). The lines are
one long simulation (dashed line) and adaptive sampling with 10 simulations of 20 steps (solid
line), 10 simulations of 200 steps (dotted line), 100 simulations of 20 steps (dash-dot line), and
1000 simulations of 20 steps (black squares) per iteration.
Both models have states with approximately the same equilibrium and
transition probabilities, such that differences between their behaviors can be attributed
to differences between their topologies. More specifically, states 1-6 have equilibrium
populations of 6%, 1%, 1%, 1%, 1%, and 90% respectively. Drawing an analogy to
protein folding, state 1 is the unfolded state, state 6 is the folded state, and the
101
remaining states are intermediates. Thus, S has a single folding pathway and P has
parallel folding pathways.
The reduced connectivity in S results in longer timescale transitions relative to
P. In fact, the mean first passage time (MFPT) between states 1 and 6 is about three
times longer in S than in P, making S considerably harder to sample. In addition, such
linear models are often cited as a case where the holistic, long-trajectory approach is
absolutely necessary; nevertheless, adaptive sampling is able to learn the network
more efficiently than traditional approaches, as shown in Figure 26B. This figure
shows how close various schemes can approach the true model for S given a set
amount of wall-clock time and starting from state 1 to mimic the practice of starting
protein folding simulations from an arbitrary conformation in the unfolded state.
To provide some intuition for our distance metric, Figure 27 shows the
evolution of the relative entropy and the estimated free energy of each state in S
during adaptive sampling. Adaptive sampling was carried out by running 10
simulations from state 1 and then repeatedly building an MSM and starting 10 new
simulations from the state contributing most to uncertainty in the slowest process.
Small jumps in the relative entropy are found each time a state with a low population
is discovered (or, equivalently, when a new path is discovered for this model) and a
very large jump is evident when the most populated state, state 6, is discovered. Slow
decay occurs between these jumps. Thus, our metric is most sensitive to state and path
discovery but still captures improvements in estimates of the transition probabilities
along known paths. Such behavior is desirable as models that miss important states or
paths should be penalized more than ones with imperfect transition probabilities.
102
Figure 27. Relative entropy (top) and free energy of each state in kcal/mol (bottom) as a function of the
adaptive sampling iteration on model S.
Figure 28 shows a more thorough comparison of adaptive sampling and
reference simulations with an equal amount of sampling for various numbers and
lengths of simulations. Evaluation of the reference simulations for both S and P
demonstrates that achieving a reasonable model quality by naively starting simulations
from state 1 requires simulations of some minimal length, though this minimal length
is shorter for P than S in terms of the absolute number of steps. Moreover, adaptive
sampling is able to gain valuable information from much shorter and fewer
simulations regardless of the topology of the network; that is, whether there is a single
folding pathway or multiple pathways. This figure also shows that adaptive sampling
generally benefits from using more parallel simulations but not longer ones. An
important point is that each data point in Figure 28B and Figure 28D depends on the
data points to its left. For example, to fill in the row corresponding to simulations of
length 100, ten independent adaptive sampling runs of 50 iterations were performed.
103
The first round of each adaptive sampling run was used to compute average relative
entropies for 1-10 simulations, the first and second round of each run (which depends
on the first round) for 11-20 simulations, and so forth. As a result, there is some
horizontal streakiness in these figures. We also note that adaptive sampling results in
smaller uncertainties in the relative entropies shown in Figure 28 (see Figure 77and
Figure 78).
Figure 28. Distance from the true model (measured via the relative entropy) as a function of the number
and length of simulations averaged over 10 independent samples. (A) Reference distribution for S,
(B) adaptive sampling of S, (C) reference distribution for P, and (D) adaptive sampling of P. All
simulations for the reference distributions started from state 1. The first 10 simulations for adaptive
sampling started from state 1 and subsequent batches of simulations started from the state
contributing most to uncertainty in the slowest process. Black lines are contours of equal amounts
of data.
Finally, we find that the scaling of adaptive sampling of our simple networks is
similar to that found for villin, as shown in Figure 29. One noteworthy difference is
104
that our simple models saturate (i.e. fall short of linear scaling as additional parallel
simulations are run) earlier than villin. Comparison of the two simple models also
shows that S saturates before P. For S, adaptive sampling scales linearly up to 150
parallel simulations. For P, adaptive sampling scales linearly up to 500 simulations.
The improved scaling for P is the result of the increased complexity of the network
topology of P compared to S. Each node in P has more connections to learn and the
algorithm benefits from doing this in parallel. Indeed, the complexity of our villin
model is much greater than either of these simple networks and, as discussed
previously, villin scales linearly up to 5,000 simulations per iteration. Thus, we expect
that we can achieve linear scaling well beyond 5,000 simulations per iteration for
systems that are more complex than the villin MSM that we sampled from.
105
Figure 29. Scaling for adaptive sampling of our simple models as the number of parallel simulations (N)
used during each round is varied. (A) and (B) Wall-clock time scaling as N is varied for simple
models S and P respectively. The black line is a best fit to the linear portion of the data (circles).
(C) and (D) Computer time required to achieve a given model quality (relative entropy) for various
sampling schemes applied to S and P respectively. L refers to one long trajectory and the numbers
refer to the number of parallel simulations used in each iteration of adaptive sampling. All results
come from averaging over ten independent runs.
106
APPLICABILITY.
The adaptive sampling algorithm employed here was developed for application to
MSMs with metastable states. That is, it assumes that every state has a self-transition
probability greater than 0.5 such that a simulation in one state is more likely to stay
there than to transition to a new state. This property helps to ensure a separation of
timescales (fast intrastate transitions, slow interstate transitions) and, therefore, that
the model is Markovian because a simulation can lose memory of its previous state
before transitioning to a new one. Thus, the procedure for ab initio adaptive sampling
is: 1) run some initial simulations, 2) cluster all the simulation data into microstates, 3)
lump these microstates into metastable macrostates, 4) calculate the contribution of
each macrostate to uncertainties in the slowest rate (or some other observable), 5) start
new simulations from each state in proportion to its contribution to the overall
uncertainty, and 6) repeat steps 2-5 until the desired level of statistical certainty is
achieved. In the future it will be interesting to explore whether this adaptive sampling
algorithm is equally applicable to more fine grained divisions of conformational space
(e.g. at the microstate level) as the lumping stage would no longer be necessary. In
addition, recent work has shown that more fine grained MSMs are better for obtaining
quantitative predictions of experimental observables (4, 5, 15), so it could be
advantageous to do refinement at this level.
The relative entropy metric assumes that the two models being compared have
the same state-space. Comparing two simulation data sets therefore requires the
following steps: 1) define a state space common to both datasets (i.e. by using both
data sets for clustering to define microstates and, optionally, lumping to define
macrostates), 2) computing transition probability matrices for each data set
independently, and 3) computing the relative entropy between these matrices.
107
CONCLUSIONS
Together, our results with villin and fundamental model systems demonstrate the
tremendous value of adaptive sampling. Since model quality has been assessed with a
global metric and shows strong agreement between adaptive sampling results and the
true model, we can conclude that adaptive sampling to minimize uncertainties in the
slowest kinetic rate improves the global quality of a model. Moreover, adaptive
sampling is significantly more efficient than a single long simulation, both in terms of
the wall-clock time and resources required to achieve a given model quality, up to
some saturation point. In fact, adaptive sampling with N parallel simulations requires
about a factor of two less computer-time and a factor of N less wall-clock time.
Considering that N can easily be as large as 10,000 (or more) (79), this can be a truly
dramatic advantage in wall-clock time, turning calculations normally requiring
decades into routine calculations on the timescale of days. Finally, since our
simulations started from just a couple of states, we can conclude that adaptive
sampling is capable of discovering new model components given no prior knowledge
of the system, and is thus useful for model construction in addition to model
refinement.
The adaptive sampling method described here may be directly applied to learn
models from simulations of metastable phenomena, leading to significant resource and
time savings in fields like molecular and quantum mechanics, but is not limited to
these applications. Given a means to prepare samples within a given state, it could be
applied equally well to experimental techniques, such as single molecule FRET and
force extension experiments. More broadly, minimizing uncertainties in a model is
likely to prove valuable even when metastability is not present. Similar methods may
also be useful for understanding other complex network dynamics, as in signaling
pathways.
108
CHAPTER 7: SIMULATED TEMPERING YIELDS INSIGHT INTO THE LOW-
RESOLUTION ROSETTA SCORING FUNCTIONS
This chapter was taken from: Bowman GR & Pande VS (2009) Simulated tempering
yields insight into the low-resolution Rosetta scoring functions. Proteins 74:777-788.
ABSTRACT
Rosetta is a structure prediction package that has been employed successfully in
numerous protein design and other applications (162). Previ-ous reports have
attributed the current limi-tations of the Rosetta de novo structure pre-diction
algorithm to inadequate sampling, particularly during the low-resolution phase (150,
151, 163, 164). Here, we implement the Simulated Tempering (ST) sampling
algorithm (24, 25) in Rosetta to address this issue. ST is intended to yield canonical
sampling by inducing a random walk in temperatures space such that broad sampling
is achieved at high tempera-tures and detailed exploration of local free energy minima
is achieved at low tempera-tures. ST should therefore visit basins in accordance with
their free energies rather than their energies and achieve more global sampling than
the localized scheme currently implemented in Rosetta. However, we find that ST
does not improve structure prediction with Rosetta. To understand why, we carried out
a detailed analysis of the low-resolution scoring functions and find that they do not
provide a strong bias towards the native state. In addition, we find that both ST and
standard Rosetta runs started from the native state are biased away from the native
state. Although the low-resolution scoring functions could be improved, we propose
that working entirely at full-atom resolution is now possi-ble and may be a better
option due to superior native-state discrimination at full-atom resolution. Such an
approach will require more attention to the kinetics of convergence, however, as
functions capable of native state discrimination are not necessarily capable of rapidly
guiding non-native conformations to the native state.
109
INTRODUCTION
Since the discovery that a protein’s structure is determined by its sequence (46), a
great deal of effort has been poured into trying to predict structure from sequence.
Thus far, knowledge-based approaches have proved promising, though more purely
physics-based structure predic-tion has potential (92). The Rosetta suite is one of the
most successful approaches, and employs a combination of knowledge-based
strategies and physical insight. Some of the more prominent achievements of this
software package are the design of a protein with novel topology (165), the redesign of
protein-protein interfaces (166), the redesign of protein-nucleic acid interfaces (167),
the redesign of a folding pathway (168, 169), aid in solving the crystallographic phase
problem (170), and, most recently, the design of new enzymes for reactions without
known biological catalysts (171).
It has been suggested that the success of Rosetta is in large part due to its
accurate scoring functions (151, 172). In the sense that many of the terms are based on
energetic principles derived from physical chemistry, one can think of the scoring
functions as energy functions. On the other hand, many of the terms are based on
statistics from the PDB databank. Because they are based on native protein structures,
which are assumed to represent the lowest free energy structures for a given sequence,
these terms implicitly consider entropic contributions. Thus, the scoring functions can
be thought of as free energy functions. In addition, the practice of clustering the lowest
scoring structures is, in a sense, taking into account entropy by considering the relative
populations of various states (173). To avoid confusion we will use the term ‘‘scoring
function.’’ This is probably the most precise term as the scoring functions are
primarily designed to discriminate native structures from non-native ones rather than
to reproduce physical behavior. Furthermore, it allows us to more clearly discuss the
conformational free energy under a given scoring function.
Rosetta uses a number of scoring functions in two distinct phases: low-
resolution and full-atom. This ‘‘hierarchical’’ approach (174) was incorporated into
110
Rosetta for CASP6 (175). The low-resolution phase assumes that the conformational
search of a protein is biased by local structural preferences and that the free energy
minimum is selected by nonlocal interactions (162, 176). This is captured by building
the protein structure from fragments drawn from native protein crystal structures.
Thus, local interactions may be assumed to be at free energy minima and a coarse-
grained sampling of the nonlocal free energy landscape may be carried out (176).
During this phase, sidechains are represented by single atoms called centroids, thus
sacrificing atomic resolution for rapid sampling. All of the scoring functions employed
in this phase are dominated by the hydrophobic effect (162, 164) and are intended to
give the correct topology (162, 176). Full-atom refinement employing a single scoring
function is then carried out on each low-resolution model (151). This phase is intended
to give atomic resolution with correct packing (162). However, the full-atom scoring
function only tends to give accurate results when the starting low-resolution model is
within 3A of the native state, the ‘‘radius of convergence’’ (163, 174). Thus, the full-
atom phase is highly dependent on the success of the low-resolution phase. Together,
these two phases represent the belief that the native state lies at the bottom of a deep
minimum at the center of a broader basin (162, 173).
A number of recent works have claimed that the main challenge preventing
better structure prediction with Rosetta is sampling, particularly in the low-resolution
phase (150, 151, 163, 164). They suggest that improved sampling at low-resolution
would give more structures within the radius of convergence, and thus better full-atom
structures.
To address this issue, we have implemented the Simulated Tempering (ST)
sampling algorithm in the low-re-solution phase of Rosetta. This algorithm is intended
to allow rapid barrier crossing by performing a random walk in temperature space. At
high temperatures broad sampling may be achieved, while at low temperature various
free energy minima may be explored. ST is a serial algorithm so it is amenable to an
automated distributed computing effort like Rosetta@home (151), whereas related
parallel algorithms like the Replica Exchange Method (REM) are not (27, 177).
111
METHODS
OVERVIEW OF ROSETTA
The standard Rosetta de novo structure prediction protocol (RSP) is designed to
predict the structure of a protein given its sequence. The algorithm begins with a fully
extended chain. First, a low-resolution phase is car-ried out in which side chains are
represented with cent-roids, single atoms which recapitulate the properties of the
sidechain. The centroids are located at the center of mass of the sidechain obtained
from averaging over all the conformations found in the PDB databank. A Monte Carlo
approach is used to substitute in segments from fragment libraries provided by the
user.
The fragment libraries consist of possible three- and nine-residue segments
from the PDB databank that match portions of the sequence. By default, 200 three-
residue and 200 nine-residue fragments are included for each overlapping segment of
the protein (176). Secondary structure predictions from PSIPRED (178), JUFO (179),
SAM (180), and PROF (181), are used to guide the selection of these seg-ments (151).
These fragments are chosen such that the pro-portion of possible helix, strand, and
other configurations is equal to the average prediction of all the secondary structure
prediction programs used (176). One may install the software for generating these
segments locally or, as in the case of this work, use the Robetta server
(http://robetta.bakerlab.org/). Three- and nine-residue sequences are used as they have
the most significant correlations in local structure (182).
Only the torsion angles are modified when a fragment is inserted. Bond lengths
and angles are held constant. The values for the bond lengths and angles are taken
from CHARMM19 (183, 184). When generating the fragment libraries, the torsion
angles are modified from those in the PDB databank to maintain consistency with
these ideal bond lengths and angles (176).
112
Three major factors are intended to guide the algo-rithm to the native structure:
1) a series of scoring func-tions based on distributions from the PDB Databank and
Bayesian inference (48, 172), 2) returning to the lowest scoring structure found thus
far at regular intervals, and 3) a temperature schedule called quenching that is
designed to detect and escape local minima.
The possible components of each scoring function are described in detail
elsewhere (176). Since each bit of local structure comes directly from native proteins,
it is assumed to be at a free energy minimum. Thus, the low-resolution scoring
functions focus on giving a rapid coarse-grained approximation of the free-energy
landscape for nonlocal interactions and are meant to find the global topology (162).
One of the major driving forces is hydrophobic burial (151, 164).
Figure 30 shows the order in which the scoring functions are used. The final
low-resolution de novo structure prediction scoring function, score4, is supposed to be
able to distinguish native structures. The other scoring functions are mainly variants of
score4 meant to help bias the structure towards the native state as quickly as possible.
Rosetta begins with one or two cycles of 2000 Monte Carlo steps with score0, which
only has a Van der Waals term. This scoring function serves to insert a fragment in
each position of the extended chain in order to provide a more or less random starting
point for the subsequent scoring functions. Next, a single cycle of 2000 steps is carried
out with score1, which is meant to accu-mulate secondary structure (176). Rosetta
then performs five repetitions of a 2000 step cycle with score2 followed by a 2000
step cycle with score5. Score2 includes terms to favor collapse and beta strand pairing
while score5 is similar but lacks these two terms to allow some relaxation. Three
cycles of 4000 steps each are then carried out using score3. Score3 includes all of the
possible low-resolution (or centroid) terms, except for hydrogen bonding. The first
cycle of score3 uses the normal fragment insertion scheme. The remaining cycles use
smoothing steps as described by Rohl et al. (176) to make small perturbations that
relax the structure. Finally, score4, which does not have any compaction or beta-strand
pairing terms, is used to rank the lowest scoring structure seen so far.
113
Figure 30. Flow chart showing the order the scoring functions are used in and giving brief descriptions
of each. After score5, Rosetta returns to score2 five times before progressing to score3. The first six
scoring functions constitute the low-resolution de novo structure prediction phase.
Beginning with score2 Rosetta returns to the lowest scoring structure seen so
far at the end of each cycle (approximately every 2000 steps). The temperature is
implicitly in units of kT. By default the temperature is initially set to 2 kT and is
updated using a quenching scheme. If 150 steps are performed without any being
accepted then it is assumed that a local minimum has been reached and the
temperature is increased by 1 kT, thus increasing the probability of accepting
subsequent moves (176). As soon as a step is accepted, the temperature is quenched:
that is, the temperature is immediately reset to 2 kT.
114
At present, it is standard practice to perform full-atom refinement on each of
the low-resolution models (151). An example command-line for generating a low-
resolution model of protein G and then refining it is as follos:
rosetta.gcc64 aa 1igd A -verbose -silent
-increase_cycles 10 -new_centroid_packing
-abrelax -output_chi_silent -stringent_relax
-vary_omega -omega_weight 0.5 -farlx
-ex1 -ex2 -termini -short_range_hb_weight 0.50
-long_range_hb_weight 1.0 -no_filters
-rg_reweight 0.5 -rsd_wt_helix 0.5
-rsd_wt_loop 0.5 -output_all -accept_all
-do_farlx_checkpointing -relax_score_filter
-record_irms_before_relax -acceptance_rate 1.0
-filter1a 10000 -filter1b 10000 -nstruct 1
-constant_seed -jran 1918492
The purpose of refinement is to get atomic level accuracy with correct packing
of sidechains (162). Exploring the full-atom free energy landscape is considerably
more expensive than exploring the low-resolution one because of the atomic resolution
and the inclusion of local interaction terms. To minimize the computational expense, it
is assumed that the low-resolution starting model has the correct topology and only
conservative backbone moves are made (176). It is hoped that these conservative
moves will also help to get adequate exploration in the context of a compact chain
115
where large moves are likely to cause clashes. The backbone moves include small
random perturbations to single torsion angles, alterations of a series of torsion angles
such that the global structure is preserved, and gradient descent. The torsion angle
potential is based on distributions from the PDB databank (151). Correct packing is
achieved by rotamer optimization (162, 163, 185). Solvation effects are captured by
employing the EEF1 implicit solvent (186). Even with the assumption that the starting
model has the correct topology, refining every model is still very expensive and is
only made possible through the use of distributed computing on Rosetta@home (151).
An important addition to the full-atom scoring func-tion is a direction-
dependent hydrogen bonding term (187). This potential term is based on distributions
from the PDB and has been shown to provide better native state discrimination than
Coulomb-based hydrogen bonding terms like those found in standard molecular
dynamics packages (187) and to agree with quantum calculations (188). Both
backbone-backbone and sidechain hydrogen bonds are included but the backbone-
backbone terms have been found to provide the best native state discrimination (187).
Hydrogen bonds are short range interactions, so it is not surprising that this potential
gives the best discrimination for decoys within 1-3A of the native state.
A similar hydrogen bonding term is also included in a brief low-resolution
relaxation performed before beginning the full-atom refinement. This low-resolution
scoring function is called score6. Relaxation in this scoring function uses a
conservative move set similar to that in full-atom relaxation.
The final prediction is made by performing an RMSD (RMSD over Cα atoms)
clustering of the 100 to 1000 lowest scoring full-atom models (151) and selecting
those with the greatest number of neighbors within a cutoff that depends on the size of
the protein (173).
116
MODIFICATIONS TO ROSETTA
SIMULATED TEMPERING
In this work, the Simulated Tempering (ST) sampling algorithm (24, 25) was
implemented in place of the default quenching temperature schedule. ST allows the
system to perform a random walk in temperature space with ca-nonical sampling at
each temperature. At high temperatures the free energy landscape is flattened,
allowing broad sampling of conformation space. At low temperatures, barriers are
present and tend to confine the system to exploring a single free energy minimum. By
performing a random walk in temperature space a single run is able to explore
multiple minima, thus speeding convergence. For a detailed derivation of ST refer to
Huang et al. (177) and the original works (24, 25).
ST requires an initial temperature, a list of possible temperatures, and a list of
weights for each temperature as inputs. For this work the possible temperatures are
0.1, 0.25, 0.5, 0.75, 1, 2, 3, 5, 10, and 20 kT. At regular intervals an attempt is made to
change temperatures. For each attempt, the algorithm randomly decides to go either up
or down in temperature. The probability of accepting the attempt is
,1min)( )()()( ijij ggXUejiP (7.1)
where, P(i→j) is the probability of transitioning from temperature i to temperature j,
)/(1 iBi Tk , U(X) is the potential energy of conformation X (or in this case the
score), and gi is the weight of temperature i. Assuming the weights are properly
selected, this probabilistic temperature changing ensures that the detailed balance
condition for equilibrium is satisfied. That is,
)()()()( ijPXPjiPXP ji (7.2)
117
where, Pi(X) is the probability of conformation X at temperature i and P(i→j) is the
probability of transitioning from temperature i to temperature j. In addition, it can also
be shown that for a correct set of weights
)()( ijPjiP (7.3)
where, P(i→j) is the probability of transitioning from temperature i to temperature j.
Furthermore, ensuring that Eq. (7.3) is satisfied is sufficient to yield correct weights
(189).
From Eq. (7.1) it is evident that the probability of making a temperature
change is controlled by the energy distribution (or in this case score distribution),
temperature spacing, and the difference between the weights of a pair of neighboring
temperatures. To choose the temperature list and an initial set of weights constant
temperature runs were carried out at a variety of temperatures ranging from 0.1 to 20
kT. Temperatures were selected such that weights could be found yielding
. Twenty iterations of 100 runs at 10 times the
default length were then carried out, updating the weights after each iteration to satisfy
Eq. (7.3). This protocol yielded converged weights for all of the systems studied and is
thus a plausible candidate for a fully automated system compatible with a distributed
computing environment like Rosetta@home.
5.0)()( ijPjiP
OTHER OPTIONS
The user may change the frequency of temperature change attempts, the frequency of
outputting structures throughout a given run, and whether the program recovers the
lowest scoring structure seen thus far between cycles. Another option allows the
exploration of the final scoring function alone, thus removing any bias from the other
scoring functions or the regular returns to the low-est scoring structure.
118
STRUCTURE PREDICTION PROTOCOLS
Structure prediction is carried out employing the same procedure used by the Baker
lab in CASP7 (151), though full-atom refinement was only used in a subset of cases.
For each structure 10,000 independent runs are carried out using 10 times the default
number of steps for each cycle. Each low-resolution run took about 2 min on an Intel
E5345 quad-core 2.33 GHz processor. Full-atom refinement of a model took an
additional 4 min. The lowest scoring structure from each run is stored. All of these
structures are clustered by RMSD and the top five cluster centers are selected as the
best predictions.
A similar procedure was used for comparing the various scoring functions. In
this case, 1000 independent runs each with 100 times the default number of steps were
performed to ensure adequate exploration of the whole space. Each RSP run using the
full sequence of scoring functions took about 15 min on an Intel E5345 quad-core 2.33
GHz processor while using just score4 took 30 min. The ST variant took 30 min and 1
hr when using all the scoring functions and just score4 respectively. For ST, an
independent set of weights is used for each scoring function to ensure canonical
sampling of each of them.
To characterize the native state, the crystal structure was idealized and relaxed
50 times. Idealization consists of setting the bond lengths and angles to ideal values.
Relaxation is carried out to compensate for any deleterious effects resulting from the
idealization process. Relaxation may be carried out in either the low-resolution score6
space or the full-atom scoring function using the conservative move set described in
the ‘‘Overview of Rosetta’’ Section. The native structure used as the starting point for
many of the runs described below was the one with the lowest RMSD out of the 50
idealization/relaxation runs carried out in the low-resolution score6 space.
˚ For protein G, this structure has an RMSD of 0.88 Å. Projections of the free
energy landscapes from ST runs were generated using the Multistate Bennett
Acceptance Ratio (MBAR) estimator (190), a variant of the Weighted Histogram
119
Analysis Method (WHAM) (191), to make use of data from all of the temperatures.
Projections from RSP were generated using all the data. These plots are analogous to
free energy landscapes but we note that RSP does not guarantee a canonical
distribution.
RESULTS
COMPARISON OF ST AND STANDARD ROSETTA
The structure prediction protocol was carried out for four systems of varying size: an
SH3 domain (58 resi-dues, PDB code 1shf (192)), protein G (61 residues, PDB code
1igd (193)), ubiquitin (76 residues, PDB code 1d3z (194)), and a zinc finger (136
residues, PDB code 2j6a (195)). All are X-ray structures with a resolution less than 2
Å with the exception of ubiquitin, which is the sole NMR structure evaluated in this
work.
SH3 DOMAIN
In a previous work Bradley et al. found poor results for the SH3 domain studied here
(150). They attributed these poor results to inadequate sampling, thus, this target
seemed like a good test for enhanced sampling. The ST results proved to be
qualitatively similar to those from RSP, as shown in Figure 31. Both algorithms
identified a pronounced score minimum at high RMSD, which may explain the poor
results from the previous study. Despite this similarity, ST did yield slightly broader
sampling of the low score space. Figure 31 also shows black plus-signs corresponding
to the idealized native structure relaxed with the score6 low-resolution scoring
function, which includes a direction dependent hydrogen bonding term, and then
scored with the final scoring function, score4. This data indicates that the native
structure is stable in score6 but is not recognized as native by the final scoring
function.
120
Figure 31. Score versus RMSD (Å ) for an SH3 domain (PDB code 1shf). Each diamond represents the
lowest scoring structure for a single run. Data for ST is shown in blue while data for standard
Rosetta is shown in red. The black ‘‘+’’ symbols represent models obtained by idealizing and
relaxing the crystal structure in low-resolution mode.
PROTEIN G
Protein G was chosen as a small and tractable target. Figure 32(A) shows the results of
low-resolution de novo structure prediction while panel (B) shows the results for full-
atom refinement of the low-resolution models. Both RSP and the ST variant perform
well on this system, finding low scoring structures with RMSD values as low as 1.5 Å.
The lowest scoring de novo structures had RMSD values of about 2 Å but there is a
clear correlation between low score and low RMSD. The ST and RSP results were
qualitatively identical for both low-resolution runs and full-atom refinement. Full-
atom refinement does not appear to greatly change the RMSD. On average the RMSD
changed by -0.03 Å with a standard deviation of 0.3 Å between the low-resolution and
full-atom phases. The black plus-signs in Figure 32(A) demonstrates that the native
structure is stable in score6 but not recognized by score4, as was the case for 1shf.
Figure 32(B), however, shows that the full-atom scoring function assigns low scores to
the idealized native structure when relaxed in the full-atom scoring function (yellow
circles). Furthermore, the low-resolution relaxed native structures that were not
recognized by score4 are assigned low scores after relaxation with the full-atom
121
scoring function (black asterisks). In fact, these structures are closer to the full-atom
idealized/relaxed structures than any of the de novo structures.
Figure 32. Score versus RMSD (Å ) for protein G (PDB code 1igd). Each diamond represents the
lowest scoring structure for a single run. Data for ST is shown in blue while data for standard
Rosetta is shown in red. Panel (A) shows results from the low-resolution phase. The black ‘‘+’’
symbols represent models obtained by idealizing and relaxing the crystal structure in low-
resolution mode. Panel (B) shows results from the full-atom phase. The yellow circles represent
models obtained by idealizing and relaxing the crystal structure in full-atom mode. The black ‘‘*’’
symbols are full-atom models obtained by relaxing the low-resolution structures depicted by ‘‘+’’
symbols in (A) using the full-atom scoring functions.
UBIQUITIN
Ubiquitin was selected due to its larger size and to evaluate the accuracy on NMR
structures. Both Rosetta variants gave equivalently good results on this system (data
not shown). Like protein G, there was a general correlation between score and RMSD
and structures with as low as 2.5 Å RMSD were reached. Once again, the idealized
and relaxed native structure was not assigned a low score by score4.
ZINC FINGER
Finally, the 136 residue zinc finger was a CASP7 tar-get selected to push the limits of
the algorithms. Rosetta tends to perform well on proteins with less than 100 residues
122
(151). If ST is indeed giving enhanced sampling then one would expect for it to
outperform RSP on larger systems. However, once again both Rosetta variants gave
equivalent and poor results, with RMSD values no less than 10 A, and the idealized
and relaxed native structures were not recognized by score4 (data not shown).
The close agreement between the ST results and those of RSP in all cases
indicates that ST is not giving enhanced sampling. The most probable explanations are
that ST is not capable of escaping free energy minima in the Rosetta scoring functions
or that RSP is correctly identifying all the accessible minima of the scoring functions.
To explore these alternatives a more extensive analysis of the protein G results is
conducted as this small system allows many trials to be run. Results presented by
Alena Shmygelska and Michael Levitt at our Structural Biology Retreat on Nov. 15,
2007 showed that both Temperature replica exchange and Hamiltonian replica
exchange do sample significantly better than the original Rosetta Monte Carlo method.
VALIDATION OF ST
Figure 33 shows an example of the evolution of the weights throughout the weight
determination protocol, demonstrating that it yields converged weights. The difference
in weights is plotted as it is this difference, and not the absolute value of the weights,
that determines the acceptance probability. The weight differences at high
temperatures converge very quickly, consistent with a more or less flat free energy
surface allowing broad sampling. The weight differences at low temperatures con-
verge more slowly, consistent with a more rugged land-scape and restricted sampling.
Independent runs of the weight determination procedure also produced more or less
equivalent results (data not shown). The convergence of the weights is good evidence
that the algorithm is working properly. These converged weights yield equal sampling
of temperature space and multiple visits to both high and low temperatures in each
run.
123
Figure 33. Evolution of the score4 weights for protein G. The dashed line is the difference between the
weights of the highest two temperatures: 10 and 20 kT. The solid line is the difference between the
weights of the lowest two temperatures: 0.1 and 0.25 kT. The first points come from constant
temperature runs and subsequent points represent each iteration of refining the weights. Δg=gj-gi
where, j > i.
Figure 34(D)–(F) show projections of the free energy surface onto score versus
RMSD for protein G runs started from the native structure at a range of temperatures
(0.1, 2, and 20 kT, respectively). Figure 34(D) shows that at low temperature, the
system spends considerably more time at low score and low RMSD. The higher scores
and RMSDs at high temperature show that high enough temperatures are being
reached for the system to escape local minima and achieve broad sampling.
124
Figure 34. Projections of the free energy landscape onto score versus RMSD (Å ) for protein G in
score4 using: (A) standard Rosetta runs starting from an extended chain, (B) standard Rosetta runs
starting from the native state, (C) ST runs at 0.1 kT starting from an extended chain, (D) ST runs at
0.1 kT starting from the native state, (E) ST runs at 2 kT starting from the native state, (F) ST runs
at 20 kT starting from the native state. Each white plus-sign corresponds to the lowest scoring
structure for a single run. The lowest scoring structures from each run were sorted by RMSD and
only every twentieth point is shown so as to give the entire range without obscuring the underlying
plot.
THE FINAL SCORING FUNCTION
In theory, the sequence of scoring functions employed by Rosetta was chosen to bias
the system to low score and low RMSD as quickly as possible. The final low-
resolution scoring function (score4) is only applied to the lowest scoring structure
125
found while exploring the previous scoring function (score3). Thus, it is difficult to
judge whether or not the system is truly being biased towards the global free energy
minimum of score4. To test this, both ST and RSP were applied to the final scoring
function in isolation.
Figure 34 shows that this analysis yields qualitatively similar results for both
ST and RSP, particularly at low score and low RMSD. The agreement between the
results generated with the same algorithm but different starting states demonstrates
that the landscapes are con-verged and, therefore, represent the entire accessible space.
The agreement between the ST and RSP results suggests that both algorithms are
identifying the free energy minimum. Furthermore, the ST results show that running at
lower temperatures does not significantly shift the free energy minimum towards the
native state. Limiting the temperature range used by ST to 0.1–3 kT to more closely
parallel the temperatures explored by RSP gave similar results (data not shown).
A rough time course of the evolution of the RSP landscapes was found by
plotting projections of the free energy landscape for the first third of each run, the
second third, and the final third. All three plots were identical to Figure 34(A),(B)
(data not shown), further supporting the conclusion that the landscape is converged
and indicating that the algorithm is capable of crossing any barriers present on short
timescales. Analyzing the tem-perature throughout the RSP runs shows that 90% of
the runs increased their temperature to 3 kT at some point but that only about 10%
increased their temperature to 4 kT and none reached temperatures greater than 6 kT.
Moreover, less than 10% of the time was spent at temperatures greater than 2 kT.
The global minimum in this projection covers a range of about 5–20 Å. The
differences present between the RSP and ST results are due to the greater temperature
range and approximately equal sampling at each temper-ature in ST. Visits to lower
RMSD values rapidly become less likely below about 5 Å. Conformational clustering
was carried out to confirm that the projection onto score versus RMSD is not hiding a
highly populated minimum closer to the native state. The centers of the top 10 most
126
populated clusters were found to fall within the minimum of the landscape, confirming
the validity of this projection. Thus, the final scoring function provides only a small
bias towards the native state and has only small barriers.
Figure 34 also includes white plus-signs for the lowest scoring structure visited
for each run. Once again, there is strong agreement between RSP and ST started from
both native and extended structures. These points fall in a range of about 2.5–15 Å.
The spread is slightly larger for ST runs, which is to be expected given that more time
is spent at higher temperatures.
The large variation of the score with RMSD is also of interest. It seems that
while low scores are correlated with low RMSD values high scores are not necessarily
indicative of high RMSD values.
THE OTHER SCORING FUNCTIONS
Given this analysis of the final scoring function, it is interesting to ask where the bias
towards the native state seen in the RSP results comes from. Two possible answers
are: (1) the sequence of other scoring functions and (2) the frequent returns to the
lowest scoring structure found so far. To explore these possibilities 1000 runs at 100
times the default length were carried out with both ST and RSP with and without the
regular returns to the lowest scoring structure.
Figure 35 shows projections of the free energy landscape onto score versus
RMSD for each of the Rosetta scoring functions. The first two columns show general
agreement between RSP runs with and without returns to the lowest scoring structure.
Strong agreement between ST runs with and without returns was also found (data not
shown). General agreement is found between the RSP and ST results as well. Though
ST has a stronger bias towards lower scoring structures due to the lower temperatures
reached, there is no apparent bias towards lower RMSD structures. Figure 35 also
includes white plus-signs corresponding to the lowest scoring structure for a single run
in the given scoring function. Although these points are slightly more localized to low
127
RMSD when the low scoring structure is regularly recovered, lower RMSD values are
not reached. Thus, it would seem any bias comes from the sequence of scoring
functions employed, though returning to the lowest scoring structure between cycles
may speed up the process slightly.
Figure 35. Projections of the free energy landscape onto score versus RMSD (Å ) for protein G. Each
white plus-sign corresponds to the lowest scoring structure for a single run. The lowest scoring
structures from each run were sorted by RMSD and only every twentieth point is shown so as to
give the entire range without obscuring the underlying plot. (A), (D), (G), and (J) show data from
standard Rosetta runs with frequent recovery of the lowest scoring structure in score1, score2,
score5, and score3 respectively. (B), (E), (H), and (K) show data from standard Rosetta runs
128
without frequent recovery of the lowest scoring structure in score1, score2, score5, and score3
respectively. (C), (F), (I), and (L) show data from ST runs at 0.1 kT without frequent recovery of
the lowest scoring structure in score1, score2, score5, and score3, respectively.
In fact, Figure 35 indicates that this is so. Score1 shows the broadest sampling
and the least bias towards low score and RMSD. Score2, on the other hand, has a
global free energy minimum at about 5 Å. Score5 once again allows slightly broader
sampling. Relatively broad sampling is achieved by score3 as well, however, this
scoring function reaches the lowest RMSD values and is the only one to have a global
minimum extending well below 5 Å. Based on these results it appears that score2 and
score3 provide the greatest biasing force towards the native state. The main
distinguishing feature of these scoring functions is the inclusion of compaction and
beta-strand pairing terms.
DISCUSSION
In previous works, many of the limitations of Rosetta have been attributed to
inadequate sampling, particularly at low-resolution (150, 151, 163, 164). However, the
failure of the Simulated Tempering (ST) sampling algorithm to give any improvement
on a range of targets indicates that this may not be the case. The fact that ST does not
yield any improvement on a larger target where any increase in sampling would be
most beneficial is particularly noteworthy. Plots of the free energy landscape at
varying temperatures demonstrate that ST is working and that these results are not just
due to a failure to reach sufficiently high temperatures.
The results presented in this work indicate that the low-resolution scoring
functions are the main limitations to Rosetta’s performance, and not the sampling
methods employed. This conclusion is supported by the fact that the final low-
resolution scoring function (score4) fails to recognize the native structure for any of
the targets examined. Furthermore, the free energy landscape for the score4 function
has a broad minimum ranging from 5–20 Å RMSD. This landscape is judged to be
129
converged based on agreement between data generated from two different starting
conformations and a rough time course of the landscape. If the low-resolution scoring
functions were capable of recognizing the native state but it were just very difficult to
get there one might expect ST runs starting from the native state to give worse results
than standard Rosetta runs started from the native state because increasing the
temperature would cause unfold-ing without subsequent refolding. The fact that
standard Rosetta runs in score4 started from the native state are no more likely to
identify near-native states highlights the bias of the low-resolution scoring functions
away from the native state. Two of the other low-resolution scoring functions are
found to have minima closer to the native state, presumably due to the inclusion of
compaction and beta-strand pairing terms. However, this minimum at around 5 Å is
still insufficient to satisfy the 3 Å radius of convergence required by full-atom refine-
ment (163, 174). In principle, lower temperatures may promote more sampling at
lower scores but the ST results at 0.1 kT show that in practice this doesn’t provide
much, if any, improvement over the standard 2 kT quenching runs. Finally, plotting
the free energy landscape of structures throughout numerous runs rather than just a
scatter plot of the lowest scoring structure from each run demonstrates that there is
only weak correlation between low score and low RMSD. Although low scores may
be indicative of low RMSD values, high scores are not necessarily correlated with
high RMSD values. Together, these results indicate that the low-resolution scoring
functions do indeed allow rapid and complete exploration of a coarse grained
landscape as intended but this landscape does not have the desired near-native free
energy minimum.
The conclusion that the low-resolution scoring functions are the main limiting
factor in Rosetta is also sup-ported by a number of recent works. For example, Misura
et al. note that generating more low-resolution models and then refining them did not
improve the accuracy of de novo structure prediction while performing more
independent refinement runs of each low-resolution model did (163). This observation
seems to indicate that sampling is a problem at high-resolution but not low-resolution.
130
Furthermore, the presence of false minima in the low-resolution scoring function has
been acknowledged (150, 151).
One approach to addressing these issues would be to improve the low-
resolution scoring function. This could be accomplished through improving either the
thermodynamics, kinetics, or both. On the thermodynamic front, the ideal case would
be to have a scoring function free energy minimum near the native state. This would
ensure that each run would be likely to find a near-native state and preferably spend a
significant amount of time sampling near-native regions. One way to achieve this goal
would to make native and near-native states have even lower scores to compensate for
the apparent entropic benefits of higher RMSD structures.
An alternative approach would be to think more about kinetics. At present
most runs seem to find a lower scoring structure than those in the free energy
minimum but do not appear to spend a great deal of time there. Assigning even lower
scores to these structures could bias the sampling towards these regions. However, it
may be the case that it is just too difficult to get to these structures. To illustrate, one
can imagine that a map showing the locations of cities but no roads would give an
accurate indication of the distance between two points but no indication of the fastest
route between them. Likewise, assigning a structure a low score may accurately
identify it as a native-like structure but not solve the problem of getting to that
structure from an extended conformation.
One final way of improving the low-resolution scoring functions would be to
improve the move set. In one recent work, it is acknowledged that the ‘‘single-
fragment insertion approach makes many global conformers dynamically
inaccessible’’ (176). Incorporating more conservative moves could improve results.
However, making smaller moves would slow down exploration of the space, defeating
the purpose of having a low-resolution phase in the first place.
131
Alternatively, one could forego the low-resolution phase altogether in favor of
increased sampling at full-atom resolution. The hierarchical approach was developed
in recognition of the fact that using more detail from the beginning was too expensive
but that low-resolution models in isolation make prohibitive simplifications (174).
However, creating an adequate and generalizable low-resolution model may not be
worth the cost and effort. If the low-resolution scoring functions are not accurate
enough then they will tend to bias the structure away from the native-state. At present,
the low-resolution phase gives structures in the range of 3–6 Å starting from an
extended chain (176) and this work shows that starting from the native-state gives
equivalent results.
Furthermore, the full-atom refinement carried out in this work gave negligible
changes in RMSD, though other works have claimed 1.5–4 Å changes (163). In either
case, the low-resolution phase is unlikely to give many structures close enough to the
native backbone structure. And, without the correct backbone structure it is nearly
impossible to get the correct packing (162). In addition, the problems found in the
low-resolution phase are compounded by inaccurate secondary structure prediction
(151). This dependency could be removed by working solely at full-atom resolution.
The sampling required for such an endeavor is daunting, but recent distributed
computing efforts such as Folding@home (79) and Rosetta@home (151) may make it
feasible. Furthermore, the fact that the Rosetta full-atom scoring function is starting to
yield improvements in homology modeling shows that it’s accuracy is promising
(151). Finally, some recent work in purely full-atom structure prediction without the
sampling power of a distributed computing platform shows that this approach may be
viable (196, 197). One can even imagine a new hierarchical approach in which a large
move set full-atom phase is followed by a second phase with more conservative
moves. Of course, developing such a method would still require careful attention to
the difference between native state discrimination and a function capable of guiding a
non-native state to the native state.
132
CONCLUSIONS
We have implemented the Simulated Tempering (ST) sampling algorithm in Rosetta
to test whether improved sampling in the low-resolution phase can improve Rosetta
structure prediction. The low-resolution Rosetta scoring functions are shown to be
adequately sampled by both standard Rosetta and a Simulated Tempering variant.
Agreement between data generated from an extended and a native conformation
supports the conclusion that the entire space is being sampled. Similar agreement
between the results from both algorithms indicates that the scoring functions do not
have near-native free energy minima. Thus, the low-resolution scoring functions, and
not sampling at low-resolution, are the main limitation to accurate Rosetta structure
prediction.
Structure prediction with Rosetta may be improved by correcting the low-
resolution scoring functions. However, given current computational resources, such as
Folding@home and Rosetta@home, it may be time to work at full-atom resolution
from the beginning. Such an endeavor would require careful consideration of kinetics.
Although functions designed for native-state discrimination may be able to correctly
distinguish between native and nonnative conformations, that does not necessarily
indicate that they are well suited for guiding nonnative conformations to the native
state. Even if one does not care about predicting physical kinetics (e.g. the rate of
folding in simulation compared with experiment), rapid kinetics of reaching the native
state is crucial for the convergence of the simulation results and the general efficiency
of the method.
CHAPTER 8: THE ROLES OF ENTROPY AND KINETICS IN STRUCTURE
PREDICTION
This chapter was taken from: Bowman GR & Pande VS (2009) The roles of entropy
and kinetics in structure prediction. PLoS One 4:e5840.
ABSTRACT
Here we continue our efforts to use methods developed in the folding mechanism
community to both better understand and improve structure prediction. Our previous
work demonstrated that Rosetta’s coarse-grained potentials may actually impede
accurate structure prediction at full-atom resolution. Based on this work we postulated
that it may be time to work completely at full-atom resolution but that doing so may
require more careful attention to the kinetics of convergence. To explore the
possibility of working entirely at full-atom resolution, we apply enhanced sampling
algorithms and the free energy theory developed in the folding mechanism community
to full-atom protein structure prediction with the prominent Rosetta package. We find
that Rosetta’s full-atom scoring function is indeed able to recognize diverse protein
native states and that there is a strong correlation between score and Cα RMSD to the
native state. However, we also show that there is a huge entropic barrier to folding
under this potential and the kinetics of folding are extremely slow. We then exploit
this new understanding to suggest ways to improve structure prediction. Based on this
work we hypothesize that structure prediction may be improved by taking a more
physical approach, i.e. considering the nature of the model thermodynamics and
kinetics which result from structure prediction simulations.
134
INTRODUCTION
In 1961 Anfinsen demonstrated that the native state of a protein is encoded in its
amino acid sequence and hypothesized that the native state is the lowest free energy
state (46). Since then, many researchers have dedicated their careers to understanding
the driving forces underlying protein folding in order to 1) predict the native states of
proteins from their amino acid sequences and 2) understand the mechanisms and
pathways by which proteins fold. Collectively, these components constitute the protein
folding problem (53, 70).
The protein structure prediction community has generally focused on finding a
protein’s native state based on its sequence. A typical approach is to develop a
knowledge-based scoring function to discriminate native structures from non-native
ones and to sample this potential in search of the global minimum (198). For example,
the Rosetta structure prediction package uses a Monte Carlo (MC) scheme to sample a
series of scoring functions with increasing levels of chemical detail in order to identify
protein native states (48, 49, 150). In Rosetta and many other structure prediction
schemes, the problem of finding the free energy minimum is simplified by focusing on
the energetic (or score) term (199). We note that Rosetta includes a simple implicit
solvent and some implicit accounting for entropy by using information from known
structures but stress that it does not explicitly account for conformational entropy. This
simplification is justified by arguing that the conformational entropy of the native state
is negligible and, therefore, the energetic term must be the dominant factor favoring
the native state and the energy minimum should be equivalent to the free energy
minimum. This approach has proved remarkably successful and has resulted in the
design of a protein with a novel fold (165), accurate high-resolution structure
predictions for small globular proteins (151), and the design of novel enzymes (171).
However, ignoring conformational entropy will have increasingly deleterious effects
on the landscape as one moves away from the native state and this may ultimately
prevent accurate structure prediction for more complex systems.
135
In contrast, researchers studying folding mechanisms have placed less
emphasis on predicting native states and focused on understanding how proteins fold.
This work is also based on potentials, or force fields. However, these potentials have
been designed to reproduce our physical reality rather than to simply discriminate
native and non-native protein structures. Furthermore, much emphasis has been placed
on understanding the entire free energy landscape and the kinetics of traversing this
landscape (53). To accomplish these objectives numerous advanced sampling
algorithms have been developed (21), as well as methods to visualize free energy
landscapes (52) and determine whether or not they represent the true equilibrium
distribution of the system under the given potential (177).
Here we continue our efforts to use methods developed in the folding
mechanism community to both better understand and improve structure prediction.
Our previous work demonstrated that Rosetta’s coarse-grained potentials may actually
impede accurate structure prediction at full-atom resolution (49) and this result has
been confirmed by other researchers (200). Based on this work we postulated that it
may be time to work completely at full-atom resolution but that doing so may require
more careful attention to the kinetics of convergence. To explore this possibility, we
have used Generalized Ensemble (GE) algorithms (21) to generate projections of the
landscape defined by Rosetta’s full-atom scoring function. We find that these scoring
functions are capable of recognizing the native states of both protein G and engrailed
homeodomain, an α/β and all α-helix protein, respectively. Furthermore, the score has
the desired correlation with Cα RMSD to the native state. However, there is a huge
entropic barrier to folding and the hydrogen bonding potential does not provide any
significant bias towards the native state, slowing the kinetics of convergence. Based
on these insights, we believe that further advances in structure prediction may be made
by taking advantage of methods and ideas developed in the folding mechanism
community.
136
RESULTS & DISCUSSION
GENERAL APPROACH
In order to gain a deeper understanding of Rosetta’s full-atom resolution scoring
function we have implemented a variant of the Simulated Tempering (ST) algorithm
(24, 25) in Rosetta. ST was originally intended to induce the system of interest to
perform a random walk in temperature space so that broad sampling at high
temperatures would improve mixing at lower temperatures. However, ST may be
generalized to other spaces (24). Here we define an RMSD space consisting of a
number of umbrellas constraining the system to a given Cα RMSD from the native
state. ST is then used to induce the system to perform a random walk in RMSD space
without making any alterations to the temperature (201). Furthermore, we only use
MC moves rather than the combination of MC and minimization moves used in the
standard Rosetta protocol. Thus, the system can move back and forth between the
folded and unfolded states while remaining at equilibrium. Exchanging between
umbrellas also allows the system to access all the possible conformations in a given
RMSD range (202). By performing many simulations in parallel we hope to explore
all the relevant folding pathways. Figure 36 shows that this procedure results in
reversible folding (i.e. multiple folding and unfolding events), confirming that our
simulations have reached convergence (203). The Multistate Bennett Acceptance
Ratio (MBAR) method (190), a statistically optimal variant of the Weighted
Histogram Analysis Method (WHAM) (191), is used to determine the unbiased
average values of thermodynamic properties such as energies and conformational
entropies as a function of the RMSD. All the thermodynamic measurements in this
work are dimensionless. That is, energies and free energies are given in units of the
thermal energy kT and entropies are given in units of the Boltzmann constant k.
137
Figure 36. Time evolution of the Cα RMSD of the current umbrella center for five representative
simulations demonstrating the presence of reversible folding.
We have applied this method to two systems: protein G (PDB code 1igd) (193)
and engrailed homeodomain (PDB code 1enh) (204). Protein G has an α/β fold while
engrailed homeodomain (EH) is a 3-helix bundle. Because these systems contain both
major protein secondary structure motifs our conclusions should be applicable to most
protein systems.
A THERMODYNAMIC PERSPECTIVE
The average energy (or score), conformational entropy, and free energy as a function
of the RMSD for both protein G and EH are shown in Figure 37. The average score
has a clear correlation with the RMSD and the native state is at the scoring function’s
global minimum for both systems. Thus, Rosetta’s full-atom scoring function is
indeed able to recognize diverse protein native states. However, the conformational
entropy of the native state is extremely low for both proteins. In fact, at the
temperature used during full-atom Rosetta structure prediction during the CASP
competitions (0.8 in arbitrary units, internal to the Rosetta code) the entropy
138
dominates the free energy. As a result, the native state is the free energy maximum
instead of the desired minimum.
Figure 37. Average energy (<∆E>), conformational entropy (<∆S>), and free energy (<∆F>) as a
function of Cα RMSD for protein G and engrailed homeodomain (EH).
This observation gives some insight into the limitations currently observed
with Rosetta structure prediction. Rosetta uses a hierarchical approach in which
coarse-grained structure predictions are made and then used as starting points for full-
atom refinement (49). A number of recent works have noted that for full-atom
refinement to be successful, i.e. reach RMSD values less than 2 Å, the initial
configuration must be within a “radius of convergence” of about 3 Å from the native
state (150, 199). Our results show that the free energy difference between 3 Å and 2 Å
is about 5 kT and, therefore, sampling a 2 Å structure when starting from a 3 Å
139
structure is extremely unlikely. The improbability of moving to lower RMSD
structures is consistent with the fact that one to ten thousand independent runs must be
performed in order to find a few accurate full-atom structures with Rosetta’s ab initio
structure prediction protocol (151).
TEMPERATURE DEPENDENCE OF THE FREE ENERGY
The relative importance of the energetic and entropic contributions to the free energy
may be tuned by adjusting the temperature ( SF ). Namely, the energetic
term will dominate at sufficiently low temperatures while the entropic term will
dominate at higher temperatures. By assuming that the average energy and
conformational entropy are independent of temperature we are able to predict the
temperature dependence of the free energy. We can then predict what temperature one
would have to use in Rosetta structure prediction in order for the free energy
landscape to have the desired correlation with the RMSD.
We find that the free energy landscape has the desired shape (i.e. stable native
state, unstable unfolded state) at temperatures below 0.5, as shown in Figure 38. At
temperatures above 0.5 the free energy landscape still has a maximum at the native
state. At a temperature of about 0.5 there are still non-trivial barriers between the
native and unfolded state but the free energy landscape is essentially flat relative to
other temperatures.
140
Figure 38. Average free energies (<∆F>) as a function of Cα RMSD for temperatures of 0.5 and 0.1 for
protein G and engrailed homeodomain (EH). The black lines are the hypothesized free energy at
the given temperature and the dash-dot lines are the free energy at temperature 0.8 shown for
reference.
EXPLOITING THE TEMPERATURE DEPENDENCE
While the projections of the thermodynamic landscapes shown in Figure 37 and
Figure 38 appear to be smooth, the true landscapes are actually quite rugged due to
energetic terms like hydrogen bonding and Van der Waals interactions. In order to
explore this space the standard Rosetta full-atom refinement protocol uses a
combination of MC and minimization moves (49). The minimization moves are
intended to guide the protein towards the native state at the energy minimum while the
MC moves are intended to help the protein overcome small barriers. For the MC
moves to perform this function they must use a sufficiently high temperature to
overcome small barriers but a low enough temperature to avoid mitigating the
effectiveness of the minimization moves. Simply running the standard protocol at a
lower temperature is likely to destroy this balance and prevent the system from
overcoming even trivially small barriers, thus drastically slowing the dynamics.
However, using our insights into the temperature dependence of the free energy
141
landscape it may be possible to devise a temperature ST protocol that could overcome
this roughness and reach the native state.
To test this hypothesis we have implemented a temperature ST version of the
full-atom Rosetta refinement protocol, as well as a variant of the standard protocol that
runs at a temperature of 0.1. For the ST variant we used a temperature range of 0.1 to
0.5 and a purely MC move set in order to obey detailed balance. Broad sampling
should be possible at a temperature of 0.5 because of the relative flatness of the
landscape, while at lower temperatures the native state should be favored.
Temperatures above 0.5 are not used because they would favor unfolding. The low
temperature variant allows us to ensure that any improvements seen with the ST
variant over the standard protocol are not simply the result of running at lower
temperatures. Both the standard and low temperature variants use the full set of MC
and minimization moves available in Rosetta.
Our ST variant is found to outperform both standard Rosetta and the low
temperature variant. For each of these three protocols we performed 100 runs starting
from a 5.7 Å structure, well beyond the radius of convergence, drawn from our
umbrella sampling simulations. Figure 39 shows our 5.7 Å starting structure alongside
protein G’s native state as a reference. Figure 40 shows histograms of the lowest
RMSD found in each run. One ST run reached an RMSD value of 4.8 Å and 37% of
the ST runs found structures with RMSD values lower than the initial configuration.
However, neither the standard protocol nor the low temperature variant were able to
find any structures with RMSD values less than that of the initial configuration. The
increased ability of our ST protocol to move towards the native state demonstrates that
utilizing explicit knowledge of the entropic contribution to the free energy may
improve structure prediction, even when the physical conformational entropy is not of
interest.
142
Figure 39. (A) The native structure of protein G and (B) the 5.7 Å starting structure used for comparing
the ST and Standard Rosetta variants.
Figure 40. Distribution of the minimum Cα RMSD values reached by 100 Simulated Tempering (ST)
and 100 standard Rosetta runs started from a 5.7 Å structure. Results for both the low temperature
and standard Rosetta variants were identical so only a single plot is shown.
PHYSICAL PERSPECTIVE ON ENERGETIC TERMS
A physical perspective may also be taken in order to evaluate and improve individual
energetic terms. For example, Rosetta’s hydrogen bonding term (187) is seen as a
critical component of the full-atom scoring function (199). While this term agrees with
quantum calculations (188), it has been found empirically that the hydrogen bonding
potential only helps discriminate between models within about 3 Å of the native state
(187).
We find that the hydrogen bonding term actually impedes the kinetics of
convergence while providing only a minor energetic advantage to near-native states
143
and, therefore, ultimately impedes rapid and accurate structure prediction. Figure 41
shows that the average hydrogen bonding energy is somewhat lower within about 3 Å
of the native state for protein G but not for EH. For both systems, however, the
average hydrogen bonding energy is basically flat relative to the total energy. Because
the average hydrogen bonding energy is flat, it does not necessarily provide any
guiding force to bias the system towards the native state.
Figure 41. Relative magnitude of the average hydrogen bonding energy (solid line) versus the total
average energy (dash-dot line) as a function of Cα RMSD for protein G and engrailed
homeodomain (EH).
Shmygelska and Levitt have reported that Rosetta’s hydrogen bonding
potential is better able to discriminate native from non-native states than the low-
resolution potentials (200). The most likely explanation for this apparent discrepancy
is that they weighted the hydrogen bonding term more heavily. During our simulations
the long-range hydrogen bonding term was weighted by a factor of one while the
short-range term was weighted by a factor of 0.5, following the protocol used by the
Baker group in CASP 7. If these terms were weighted more heavily relative to the rest
of the potential a stronger bias towards the native state could arise. For example, the
small dip we observe in the hydrogen bonding term for protein G could become quite
substantial. Comparing our results with those of Shmygelska and Levitt is also
complicated by the fact that they sampled the hydrogen bonding term in the context of
Rosetta’s less accurate low-resolution potentials while we have sampled it in the
144
context of the more accurate full-atom potential. A more extensive comparison of our
methods in the context of the full-atom potential is an interesting future direction.
We suggest that structure prediction potentials could possibly be improved by
avoiding such flat terms or reweighting them such that they provide a substantial
biasing force towards the native state. We note that proteins can have surprisingly fast
kinetics, with some small proteins folding on the microsecond time scale (57). One
outstanding question is whether it is even feasible to design a knowledge based
potential that can accurately identify protein native states and have kinetics that are
faster than physical kinetics. If not, physics based methods may actually be the fastest
algorithms for complex systems as they may be able to take advantage of the
evolutionary optimization or the physical processes for kinetics present in the natural
kinetics of protein folding. Even if this is not the case, our results show that structure
prediction may benefit by taking advantage of ideas developed to better understand
folding mechanisms. Informatics approaches that incorporate more physical insights
into protein folding mechanisms are thus an interesting direction (205-207).
CONCLUSIONS
Our results demonstrate that explicitly accounting for conformational entropy and
considering the kinetics of convergence may improve structure prediction even if
physical conformational entropies and kinetics are not of interest. For example, by
understanding the interplay between energy and conformational entropy one can
choose an optimal temperature or set of temperatures to use for exploring
conformational space. By considering the kinetics of convergence one can ensure that
this space can be explored rapidly, resulting in computationally efficient structure
prediction protocols. An outstanding question is whether it is possible to design
knowledge-based potentials with better entropic and kinetic properties than our
physical reality. If not, physics based structure prediction may ultimately be necessary
for more complex systems. Whether or not this is the case, our results show that
145
structure prediction may benefit by taking advantage of ideas developed to better
understand folding mechanisms.
MATERIALS & METHODS
All structural representations were generated using VMD (67).
TEMPERATURE ST
Temperature ST (24, 25) simulations perform a random walk within a pre-determined
temperature set (T1, …, Tn). This is accomplished using an expanded Hamiltonian
ii gXEXH )()(
where ii kT1 , E (X) is the energy (or score) of the current configuration (X), and
gi is the weight corresponding to Ti. At regular intervals the simulation attempts to
move either up or down in temperature space with equal probability. The probability
of accepting a given move is
),1min()( )()( ijij ggXEejiP
where P (i→j) is the probability of moving from Ti to Tj.
Our temperature ST simulations used a temperature list of 0.1, 0.15, 0.2, 0.3,
0.4, and 0.5 in arbitrary units internal to the Rosetta code and temperature exchanges
were attempted every 50 steps. All weights were determined using the Simulated
Tempering Equal Acceptance Ratio (STEAR) method (49). This method obtains an
initial estimate of the weights from short constant temperature simulations at each
temperature and then refines these weights in subsequent ST simulations before
holding them constant in the final data collection phase. Two iterations of weight
refinement consisting of 100 runs of 600,000 steps were performed for temperature ST
146
simulations, followed by 100 runs of 600,000 steps for data collection. In order to
maintain detailed balance the ST simulations only used MC moves in torsion space.
RMSD ST
RMSD ST simulations perform a random walk amongst a predetermined set of
umbrellas constraining the system to a given RMSD from the native state without
changing the system’s temperature. In this case the expanded Hamiltonian and
probability of accepting a move are
iicurrent gRMSDRMSDaXEXH ])()([)( 2
),1min()( ])()[( 22ijicurrentjcurrent ggRMSDRMSDRMSDRMSDaejiP
where kT1 , E (X) is the energy of the current configuration (X), RMSDcurrent is
the current RMSD from the native state, RMSDi is the center of umbrella i, and “a”
determines the strength of the spring constraining the system to a given umbrella.
Our RMSD ST simulations used umbrellas centered at RMSD values from 0.5
to 10 Å at 0.5 Å intervals and jumps between neighboring umbrellas were attempted
every 50 steps. The “a” parameter was set to three. All weights were determined using
the Simulated Tempering Equal Acceptance Ratio (STEAR) method (49). This
method obtains an initial estimate of the weights from short umbrella simulations at
each umbrella (without any jumps between umbrellas) and then refines these weights
in subsequent RMSD ST simulations before holding them constant in the final data
collection phase. Three iterations of weight refinement consisting of 100 runs of
1,700,000 steps were performed for RMSD ST simulations, followed by 100 runs of
900,000,000 steps for data collection. In order to maintain detailed balance the RMSD
ST simulations only used MC moves in torsion space.
147
ROSETTA
For an overview of the Rosetta structure prediction algorithm and the command-line
options used in this study see reference (49). The full Rosetta move set was used for
standard Rosetta runs. The same number of moves was used when comparing standard
Rosetta runs with ST.
148
CHAPTER 9: STRUCTURAL INSIGHT INTO RNA HAIRPIN FOLDING
INTERMEDIATES
This chapter was taken from: Bowman GR, et al. (2008) Structural insight into RNA
hairpin folding intermediates. J Am Chem Soc 130:9676-9678.
ABSTRACT
Hairpins are a ubiquitous secondary structure motif in structured RNA molecules.
Despite their simple structure, there is some debate over whether they fold in a two-
state or multi-state manner. We have studied the folding of a small tetraloop hairpin
using a serial version of the replica exchange method on a distributed computing
environment. Based on these simulations we have identified a number of intermediates
that are consistent with experimental results. We also find that folding is not simply
the reverse of unfolding and suggest that this may be a general feature of biomolecular
folding.
INTRODUCTION
RNA hairpins are one of the most common secondary structure motifs, appearing in
most every large RNA structure (208-210). In addition to serving as nucleation sites
for RNA folding (211), they may also guide RNA folding by forming tertiary contacts
(212, 213) and serve as recognition sites for RNA binding proteins (214). They are
potential drug targets (215), terminate transcription (211), and influence translation
through their role as aptamer domains in riboswitches (216). Despite the great variety
of functions they may serve, hairpins are one of the simplest RNA motifs, requiring
only monovalent ions to fold. Thus, understanding the folding of small RNA hairpins
is both a critical first step in understanding the folding of larger RNA molecules (215)
and amenable to computer simulation (217-219).
149
RNA hairpins consist of a primarily Watson-Crick base-paired stem capped
with a loop of unpaired or non-Watson-Crick base-paired nucleotides. Tetraloops,
such as the GCAA tetraloop (5’-GGGCGCAAGCCU-3’) examined in this work and
shown in Figure 42, have four such bases in their loop. This particular structure was
chosen due to its predominance in the ribosome (210).
Figure 42. (A) NMR structure of the GCAA tetraloop. (B) Contact map for the native state. Bases are
numbered from 5’ to 3’ and native base-pair contacts (dotted lines) are numbered 1-4.
Despite their simple structure there is some controversy over whether these
hairpins fold in a two-state or multi-state manner. The two-state hypothesis for nucleic
acid hairpins is primarily based on thermodynamic measurements. For example,
Ansari et al. found similar sigmoidal melting curves when they monitored all the base-
pairing interactions or a subset of fluorescently labeled nucleotides (220). The multi-
state hypothesis is based on kinetic measurements, such as FCS and T-jump
experiments. For example, Jung et al. found discrepancies between equilibrium
distributions from FCS and melting experiments (221). More recently, Ma et al. found
evidence of melting in T-jump experiments starting at temperatures above the melting
temperature (TM), indicating that the supposed unfolded state in melting experiments
is not completely unstructured (222, 223). These authors went on to propose an
intermediate state in which the ends of the hairpin are in contact but the base-pairing
and base-stacking interactions in the stem are not yet formed.
150
To determine if there is in fact an intermediate and, if so, what its structure is,
we have run Serial Replica Exchange Molecular Dynamics (SREMD) (177, 224)
simulations of the GCAA tetraloop depicted in Figure 42. Due to the heterogeneity of
the loop (225, 226) we have defined the native state as any conformation with all four
stem base-pair contacts formed, numbered as shown in Figure 42B. We refer to these
base-pair contacts as native contacts. Two nucleotides are considered to be contacting
if any two atoms, one from each nucleotide, fall within 3 Å of each other. Thus, a
structure can be well described by a contact map—a bit string specifying which
residues are in contact.
RESULTS & DISCUSSION
Previously, Sorin et al. studied the folding of this system using constant temperature
Molecular Dynamics (MD) and explicit solvent (217). While these studies provided
valuable insight into the folding of RNA hairpins, only 19 folding events were
observed within the thousands of simulations run. We have applied SREMD on the
Folding@home infrastructure to obtain better sampling and, therefore, greater insight
into RNA folding.
SREMD is a serial version of the Replica Exchange Molecular Dynamics
(REMD) (22, 23), which induces the system to perform a random walk in temperature
space such that broad sampling is achieved at high temperature and detailed
exploration of free energy minima is achieved at low temperature. In REMD, multiple
simulations are run, each at a different temperature. A random walk in temperature
space is achieved by periodically attempting to swap the conformations at two
neighboring temperatures. The probability of accepting a swap is
),1min()( ))(( jiij UUejiP (1)
where P (i→j) is the probability of transitioning from temperature i (Ti) to temperature
j (Tj), βi is 1/ (kTi), and Ui is the potential energy of the conformation at Ti. Thus, the
151
detailed balance condition is satisfied. SREMD allows any number of asynchronous
simulations to be run, making it more suitable for distributed computing than standard
REMD (177). This is accomplished by providing each simulation with the Potential
Energy Distribution Function (PEDF) for each temperature. SREMD uses the same
criteria for swapping temperatures as REM except that the energy of the current
conformation is compared to an energy randomly drawn from the neighboring
temperature’s PEDF rather than the energy from a parallel simulation. The simulation
parameters are described in detail in Appendix G.
We ran 2,800 SREMD simulations with an aggregate simulation time of 54.6
µs starting from the NMR structure (PDB code 1ZIH) (209). Even with this amount of
simulation, reversible folding was not achieved and we cannot claim to be at
equilibrium (203). However, we did observe 760 trajectories with a complete
unfolding event and 550 trajectories with a complete refolding event. Thus, we have
sufficient data to define the dominant states in the folding and unfolding pathways,
though we cannot give their relative probabilities. While SREM will not give any
kinetic information directly, an analysis of the relevant thermodynamic states can
yield information about the states along the folding and unfolding pathways.
An unfolding event is defined as the set of conformations between the first
point with no contacts between any two residues on opposite sides of the stem and the
first preceding point with four native contacts. A refolding event is defined as the set
of conformations between the first point with no contacts between any two residues on
opposite sides of the stem and the first subsequent point where the number of native
contacts is four.
We used Mapper (227, 228), topological data analysis algorithm, to identify
the dominant states in the folding and unfolding pathways. For example, to understand
unfolding we applied the Mapper technique to conformations from unfolding events,
where the conformations were represented by contact maps. The mapper clustering
technique works as follows. First, the similarity between each pair of conformations
152
was determined using the Hamming distance metric. The data set of interest was then
divided into overlapping subsets based on the density of configurations around each
conformation, allowing efficient identification of intermediate states with low
populations as well as folded/unfolded states with high populations. Single-linkage
clustering was carried out in each subset, facilitating the identification of non-convex
clusters. Finally, a graph was generated that represents the connectivity between
clusters in different density levels based on their degree of overlap. More details are
provided in the SI.
In SREM, replicas visiting high temperatures lead to rapid unfolding. To better
understand this unfolding process, we first calculated the probability of having one,
two, or three native contacts during unfolding as shown in Figure 43A. This data
indicates that there is substantial breathing, with one or two base-pairs being broken
and reformed, but that complete unfolding quickly follows the breakage of three native
contacts. Further insight is provided by Figure 43C, where we show the probability of
each native contact given that a certain number of native contacts are present.
Apparently, unfolding has a single dominant pathway characterized by unzipping from
the end. This result is confirmed by Mapper, as shown in Figure 44. There is no cluster
corresponding to a single native contact due to the low probability of such structures.
Structures with three native contacts also appear to be absorbed into either the native
cluster or the cluster with only two base-pairs formed, probably due to the use of the
simple Hamming distance metric.
153
Figure 43. The probability of a given number of native contacts during (A) unfolding and (B) refolding.
(C) The probability of each contact when a given number of contacts are present during unfolding
and refolding with the arrows representing the direction of movement between the unfolded state
(U) and the folded state (F).
Figure 44. Contact maps representing the cluster centers from independent clustering of the unfolding
(A) and refolding data (B). The grey lines represent the connectivity of the states. The blue lines
represent native contacts with a probability of 0.6 or greater within the cluster. Intermediate
structures are labeled A-D.
Figure 43B shows that there is often a single contact present during refolding
but adding subsequent base-pairs becomes progressively less likely. Thus, there are
many nucleation events consisting of the formation of a single native contact but few
proceed to the folded state. Figure 43C again shows the probability of each contact
154
given that a certain number of contacts are present. When a single native contact is
present, it is most likely between the closing base-pair or the two ends, native contacts
1 and 4 respectively. The higher probability of native contact 1 is probably due to the
close special proximity of the two participating residues imposed by their close
proximity in the sequence. The higher probability of native contact 4 may be
explained by the lack of steric hindrance relative to the other native contacts. Once
two or three native contacts are formed each is more or less equally probable, which is
consistent with numerous models.
The results from Mapper shown in Figure 44 give more insight. The first step
is either the formation of the closing base-pair or the end base-pair. This is followed
by the formation of native contacts 1 and 2 and subsequent folding is dominated by
zipping. Presumably, the formation of the end base-pair facilitates the formation of
native contacts 1 and 2 by reducing the conformational space that needs to be
searched, as predicted by Ma et al. (222). The fact that the end base-pair does not
appear in the center of the cluster with two native contacts doesn’t mean it breaks as
folding proceeds, just that it does not occur frequently within the cluster. This is
consistent with the fact that about four times as many refolding events occur through
the pathway starting with the formation of native contact 1 as go through the pathway
starting with the formation of native contact 4. Once again, we note these relative
probabilities are not necessarily expected to be found in experimental studies due to
the random walk in temperature space our simulations undergo. However, these are
expected to be the two dominant pathways.
The two folding pathways observed here are consistent with the zipping and
compactions mechanisms observed by Sorin et al. (217) as well as experimental work
pointing to the presence of multiple folding pathways (215, 229). Furthermore, these
results support the hypothesis that the folding pathway of RNA hairpins has at least
three states. In particular, the collapsed structure with a single native contact between
the end base-pair is consistent with the intermediate structure proposed by Ma et al.
(222). However, the other clusters along the folding pathway with one, two, or three
155
native contacts formed may also contribute to the experimental signal. Full-atom
structures for each of these intermediates are shown in Figure 45. Reptation (defined
as the sliding of the two strands of the stem relative to one another) is not one of the
dominant folding pathways, in agreement with results for small β-hairpins (230).
Thus, it appears that misfolded states must unfold before refolding properly, although
we cannot discount the possibility that they may contribute to folding on longer
timescales than our simulations reach. Results from the unfolding analysis using
Mapper lend further support to this hypothesis. They include small clusters of reptated
structures between the folded and intermediate states (data not shown), consistent with
the idea that misfolding serves as an off-pathway trap that slows the overall folding
process (215, 220, 223, 231).
Figure 45. Representative full-atom structures for the intermediate states with labels (A)-(D)
corresponding to the labels A-D in Figure 3.
Another result of this work is that folding and high temperature unfolding
follow different pathways. We propose that this may be a general feature of hairpin
folding, due to the intrinsic similarities in the thermodynamic forces which stabilize
156
their structure. Furthermore, the amount of sampling we have achieved and the fact
that we have still not reached convergence calls into question the results of shorter
REMD studies. Such simulations will be dominated by non-equilibrium unfolding,
which as we show here does not necessarily provide any insight into folding. Applying
measures of convergence, such as reversible folding or agreement between simulations
with different starting states, is critical for validating such studies.
CONCLUSIONS
The results presented here support recent work indicating that the folding of even the
smallest of RNA motifs is more complicated than previously suspected. We have
identified a number of folding intermediates consistent with experimental
observations. We also found multiple highly populated folding pathways but only a
single dominant unfolding pathway. Significant sampling was necessary to gain any
statistics on folding, indicating that shorter simulations are dominated by unfolding,
which differs from the folding pathway in this systems. In future works we intend to
determine the sequence dependence of intermediate states and folding kinetics. Such
work will also provide more insight into whether or not folding and unfolding differ
for biomolecules in general.
157
CHAPTER 10: RAPID EQUILIBRIUM SAMPLING INITIATED FROM NON-
EQUILIBRIUM DATA
This chapter was taken from: Huang X, Bowman GR, Bacallado S, & Pande VS
(2009) Rapid equilibrium sampling initiated from nonequilibrium data PNAS
106:19765-19769.
ABSTRACT
Simulating the conformational dynamics of biomolecules is extremely difficult due to
the rugged nature of their free energy landscapes and multiple long-lived, or
metastable, states. Generalized Ensemble (GE) algorithms, which have become quite
popular in recent years, attempt to facilitate crossing between states at low
temperatures by inducing a random walk in temperature space. Enthalpic barriers may
be crossed more easily at high temperatures; however, entropic barriers will become
more significant. This poses a problem because the dominant barriers to
conformational change are entropic for many biological systems, such as the short
RNA hairpin studied here. We present a new efficient algorithm for conformational
sampling, called the Adaptive Seeding Method (ASM), that uses non-equilibrium GE
simulations to identify the metastable states and seeds short simulations at constant
temperature from each of them to quantitatively determine their equilibrium
populations. Thus, the ASM takes advantage of the broad sampling possible with GE
algorithms but generally crosses entropic barriers more efficiently during the seeding
simulations at low temperature. We show that only local equilibrium is necessary for
ASM so very short seeding simulations may be used. Moreover, the ASM may be
used to recover equilibrium properties from existing datasets that failed to converge
and is well-suited to running on modern computer clusters.
158
INTRODUCTION
The functions of biological macromolecules are in large part determined by their
structure and dynamics. As such, many experimental techniques have been developed
and applied to probe these properties, each of which has its strengths and weaknesses.
Computational methods such as Molecular Dynamics (MD) and Monte Carlo (MC)
simulations have the potential to complement such experiments by modeling the
evolution of entire systems with atomic resolution. However, it is extremely difficult
to obtain equilibrium sampling of even moderately sized systems in atomic
simulations because of the rugged nature of the free energy landscapes that must be
explored. Without adequate sampling, it is impossible to validate the parameters, or
force fields, that determine the interactions between atoms or to address phenomena
that occur on biologically relevant timescales.
Many methods have been developed in an attempt to address the sampling
problem. Generalized Ensemble (GE) algorithms like Replica Exchange Method
(REM) (or Parallel Tempering ) (22, 23) and Simulated Tempering (ST) (24, 25) are
popular approaches for studying biomolecular folding(26-28, 177, 232-238). They
attempt to overcome the sampling problem by inducing a random walk in temperature
space while maintaining canonical sampling at each temperature. At high temperatures
energetic barriers may be crossed easily while at low temperatures the system is
generally constrained to local minima. However, recent studies have shown that GE
simulations do not yield converged equilibrium sampling much faster than standard
constant temperature MD if the phenomena of interest are non-Arrhenius. (27, 177,
238-243)
For example, Zuckerman et al. (240) used the Arrhenius equation to argue that
the maximum efficiency gain of GE simulations is no more than an order of
magnitude at physiological temperatures and Zheng et al. (241, 242) used a kinetic
network model to show that there is an optimal temperature for non-Arrhenius folding
kinetics and any time spent above this temperature will decrease the efficiency of GE
159
simulations. This lack of improvement is the result of the interplay between energy
and entropy. While high temperatures may facilitate the crossing of energetic barriers,
entropic barriers will be more difficult to cross. (27)
Thus, GE simulations will provide little improvement when the dominant
barriers are entropic. Hansmann and coworkers have made some effort to improve the
effectiveness of GE algorithms by optimizing the temperature spacing. (244, 245)
However, these methods assume that diffusion in temperature space is the rate limiting
process so crossing entropic barriers in the conformational space, which is the true rate
limiting process, is still a problem. A number of other methods also exist, such as
umbrella sampling, and milestoning (156). However, these methods require that the
dominant reaction coordinate is known a priori and this information is often
unavailable.
The sampling problem is exacerbated by the practice of viewing global
equilibration of individual trajectories as a requirement for considering a simulation to
have reached equilibrium. Global equilibration is most naturally obtained by running a
simulation much longer than the longest relaxation time of the system, so that all
degrees of freedom are equilibrated and many uncorrelated samples are generated
from each metastable state. For example, the reversible folding metric holds that a
simulation has reached equilibrium if there are multiple folding and unfolding events.
(203) While this criterion is sufficient, it may not be necessary. Instead of requiring
global equilibration of individual trajectories, we suggest that local equilibration may
be sufficient. Local equilibration may be achieved by using multiple simulations, each
of which visits only a subset of the metastable states with their correct Boltzmann
probabilities but that together cover the entire accessible space. Local equilibration
may require significantly less wall-clock time because shorter simulations (all of
which may be run in parallel) are required. The main difficulty is to analyze multiple
simulations appropriately.
160
)
Markov State Models (MSMs) are a powerful tool which can be used to extract
equilibrium properties from a dataset that satisfies the local equilibration criterion.
MSMs partition phase space into metastable states such that intra-state transitions are
fast but inter-state transitions are slow.(3, 6, 11, 33, 34) Such separation of timescales
ensures that the model is Markovian, that is, the probability of being in a given state at
time t+∆t, where ∆t is called the lag time, depends only on the state at time t. The key
point is to build a model with a lag time that is shorter than the timescale of the
process of interest with few enough states that it may be understood easily. Usually
MSMs are used to study kinetics, but here we only derive thermodynamic information
from them. In an MSM, the time evolution of a vector representing the population of
each state may be calculated by repeatedly left-multiplying by the transition
probability matrix.
( ) [ ( )] (0nP n t T t P (1)
where P(n∆t) is a vector of state populations at time n∆t, T is the column-stochastic
transition probability matrix. The first left eigenvector of the transition matrix T
corresponds to the equilibrium distribution(6). This can be an advantage and a useful
opportunity, since obtaining kinetics from MSMs is challenging, but obtaining only
the equilibrium thermodynamic properties might be a less demanding goal as less
information is required. Indeed, we find that the populations of the dominant states are
invariant with respect to the lag time so very short simulations can be used.
Here, we introduce the Adaptive Seeding Method (ASM) and show that it
rapidly yields converged thermodynamics even when faced with entropic barriers by
exploiting many simulations at local equilibrium. This is achieved by 1) using non-
equilibrium GE simulations to obtain broad sampling, 2) building a Markov State
Model (MSM) to identify all the metastable states as shown in Figure 46, 3) starting
new constant temperature simulations at the temperature of interest from each
metastable state in a process called seeding, and 4) using MSMs to extract the correct
equilibrium populations from the seeding simulations. Seeding short simulation from
161
the known equilibrium distribution of alanine dipeptide has been shown to yield good
models for its kinetics (246). A key advance in our new method is that it does not
require that the initial sampling has reached equilibrium. We note that many non-
equilibrium GE datasets have been generated due to the difficulty in reaching
equilibrium and that there is growing interest in recovering equilibrium properties
from such datasets(247). Thus, one strength of the ASM is that steps 2-4 may be used
to recover the correct equilibrium thermodynamic properties from a non-equilibrium
dataset. Furthermore, this procedure may be iterated and combined with adaptive
sampling (161) to most efficiently use one’s computational resources, i.e. using the
fewest and shortest trajectories necessary to achieve a good model, since minimizing
wall clock time is an important consideration for computer simulations.
Figure 46. A schematic free energy landscape with three representative seeding trajectories started from
each basin and a projection of this free energy landscape onto a 2D plain showing the division into
metastable states.
To test the ASM we apply it to a small biomolecular system with long time
scale dynamics: an eight nucleotide RNA hairpin (5’-GCUUUUGC-3’) known as the
162
UUUU tetraloop. Hairpins are a fundamental RNA secondary structure motif(208) and
perform many biologically relevant functions but our understanding of their folding is
still incomplete.(52, 217, 220, 222, 223) The folding of this hairpin is diffusion
controlled(215, 220, 248), so despite its small size the folding time is on the µs
timescale, as measured by laser temperature jump experiments(223). Thus, capturing a
single folding event with a single MD simulation with explicit solvent would likely
take more than a year on a typical CPU. ASM, however, is able to reach converged
equilibrium sampling within a week using many short parallel simulations, as judged
by agreement on the populations of metastable states between distinct sets of
simulations started from very different initial configurations. ASM is also found to
yield converged thermodynamic properties with at least six times less sampling than
GE simulations for this system. Finally, the fact that the most highly populated
metastable state has a well-formed two base pair stem, as in the NMR structure,
provides some validation of the force field. Since there is no analytical solution for the
equilibrium distribution of our RNA hairpin system, we also studied a 2D potential
where the equilibrium populations can be computed analytically. Using this model, we
confirm that ASM is much more efficient than ST, and also provide some guidelines
for choosing the optimal number and length of the seeding simulations.
RESULTS & DISCUSSION
COMPARISON OF ASM TO ST
Here we compare the results of our long ASM procedure with an equivalent amount of
ST sampling, as depicted schematically in Figure 47. We ran two distinct sets of
simulations starting from a near-native and random-coil configuration respectively, as
shown in Figure 81. Thus, we are able to judge the convergence of our results by
comparing these two datasets.
163
Figure 47. Schematic of the adaptive seeding scheme. The top arrow represents our ST trajectories,
which are split into equilibration (green) and production (light blue) phases. The light red and light
yellow boxes encompass our long and short adaptive seeding schemes respectively. For each
adaptive seeding scheme, the dotted lines demark the portion of the ST data used to identify the
dominant thermodynamic, or metastable, states by building an MSM (S). Constant temperature (or
canonical, NVT) simulations are then started from each state and used to build a new MSM (E) that
captures the equilibrium distribution. Both the light yellow and red boxes also encompass a portion
of the original ST data that is equivalent to the amount of sampling used in the adaptive seeding
scheme. An MSM is also built for this data and used as a baseline for judging the efficiency of the
adaptive seeding scheme.
The first step was to run an independent set of one thousand 18 ns ST
simulations starting from each initial configuration to obtain broad sampling. During
an initial equilibration phase (first 9 ns) the weights were updated using the Simulated
Tempering Equal Acceptance Ratio (STEAR) method(49, 177) described in Appendix
H. This procedure was found to give nearly equal sampling of each temperature and
converged weights for each dataset (Appendix H). During the subsequent 9 ns
production phase the weights were held constant. These two sets of ST simulations do
not reach converged sampling because of their short length (data not shown), but they
should be able to reach all the metastable states.
164
To identify the metastable states we built an independent MSM from the
production phase of each dataset. First, all the conformations from every temperature
were divided into a large number of small sets of very structurally similar, and
therefore likely kinetically similar, conformations called microstates using a
hierarchical K-medoids clustering algorithm as described in Appendix H. We then
used spectral clustering (249, 250) (PCCA (6, 44, 45)) refined with simulated
annealing to lump microstates that can interconvert rapidly into larger states called
metastable states while conformations separated by large free energy barriers are
grouped into different states, as depicted in Figure 46. This algorithm was developed
by Chodera et al (6) and is also described in the SI. This procedure yielded six states
for each dataset.
To obtain equilibrium sampling we then seeded simulations from each
metastable state. Specifically, 100 random conformations were chosen from each
metastable state and used as starting points for 10 ns constant temperature MD
simulations at 300K. The equilibrium distribution was extracted by building a new
MSM. A common state definition is necessary in order to compare different datasets
so this MSM was built using all the seeding data. Populations with error bars for each
independent dataset were then determined under the same state definition using a
Bayesian method developed by Noe.(251) Figure 48A shows that the populations from
each seeding dataset, as well as the combined data, are in strong agreement and are
therefore converged to the equilibrium distribution. Populations for an equivalent
amount of folded and coil ST data (19 ns) were also calculated by considering only
those conformations at 300 K. These two ST datasets have not converged yet. In
particular, there is a relatively obvious difference in the populations of states 2 and 4
(about 10% and 7% respectively).
165
Figure 48. Population of each state (bar graphs correspond to the mean values, and error bars stand for
standard deviations) for (A) the long adaptive seeding scheme (lag time t=4.5 ns) and (B) the short
adaptive seeding scheme (lag time t=4.5 ns).
MSMs are usually used to study kinetics(3). In order to get a reasonable
number of states and ensure that the model is Markovian, a relatively long lag time
must be used, though it generally ought to be shorter than the timescale of the process
of interest. Furthermore, to get accurate kinetics each simulation must be at least a few
times longer than the lag time so that multiple crossings of each barrier may be
observed. For example, Chodera et al. show that a twenty state MSM for the folding
of the helical Fs-peptide (which occurs on a timescale of tens of nanoseconds) requires
a lag time of five nanoseconds.(6) Thus, obtaining accurate kinetics for the UUUU
tetraloop, which folds on a microsecond timescale, should likely require orders of
magnitude longer simulations than for the Fs-peptide. However, obtaining accurate
thermodynamics may require significantly less sampling. In particular, short lag times
where the system is not Markovian may still be sufficient to estimate thermodynamic
properties. In fact, Figure 49 shows that the equilibrium populations of each state are
identical within statistical error regardless of the lag time. Similar observations have
been made by Hummer and coworkers who found that the free energy profile for a
water dewetting transition can be predicted using a very short lag time at which the
kinetics are not reproduced well(11). In addition to the error due to non-Markovian
effects, the statistical error due to insufficient sampling of transition events will also
be smaller for thermodynamic properties. In a model with N states there are only N
thermodynamic parameters to determine whereas getting accurate kinetics requires
166
determining all N2 pairwise transition probabilities. Sampling all possible transitions
over-determines the free energy differences between states. Thus, obtaining accurate
thermodynamics may require significantly less sampling.
Figure 49. Population of each state for the long adaptive seeding scheme as the lag time is varied.
MINIMIZING THE SIMULATION LENGTH
To push the limits of the ASM we repeated the above procedure using drastically less
data (See short ASM in Figure 47). Ten times less data was used for equilibration, six
times less ST data was used to identify the states, and the seeding simulations were
half as long. To maximize our use of this minimal data we combined the folded and
coil ST data to identify the metastable states used for seeding. Figure 48B shows the
populations obtained from this procedure compared to a reference distribution from
our long ASM runs and an equivalent amount of ST data started from both folded and
coil states. All these populations were determined using the previous state definition.
The populations from these short ASM runs were found to be in agreement with the
previously determined equilibrium distribution whereas the ST data deviated
significantly from equilibrium. To determine the limits of ASM we also performed the
same analysis using both fewer and shorter trajectories. First we held the seeding
trajectory length constant at 5 ns and varied the number of trajectories initiated from
each state, finding that as few as 70 trajectories from each state gave reasonable
167
agreement with the reference distribution. We also held the number of seeding
trajectories started from each state constant at 100 and varied the trajectory length,
finding that as little as 2 ns long seeding simulations gave reasonable agreement with
the reference. Thus, our ASM method reaches equilibrium at least six times faster than
ST. These results demonstrate that the ASM is significantly more efficient than GE
simulations for sampling conformational changes that are diffusion controlled, as in
hairpin folding.
To address any concerns about the validity of our reference distribution, we
also studied a simple model where the equilibrium populations can be computed
analytically. The model is based on a discrete-state system introduced by
Zwanzig(252) as a simple model for protein folding (see Appendix H for details).
There are four metastable states in the system (folded, unfoled and two intermediate
states), among which the folded state is favored energetically, while the unfolded state
is favored entropically (see Figure 85). This is an attractive system for testing ASM
because it has non-Arrehnius folding kinetics, i.e. the folding rate decreases with
temperature. (see Figure 87).
We compared the efficiency with which ST and ASM reach the equilibrium
state populations as a function of the length and number of trajectories. As shown in
Figure 92, ASM converges to the correct distribution with 4-7 times shorter
simulations than ST. We suggest that seeding simulations longer than the slowest
intra-macrostate equilibration time should always be sufficient for convergence. In
practice, however, much shorter simulations may be used as discussed before. When
using shorter simulations one should test that independent sets of simulations started
from different configurations converge to the same distribution and that the
equilibrium distribution is invariant with respect to the lag time. We also found that
using more than 200 trajectories does not increase the efficiency of ST whereas ASM
continues to scale favorably with the number of trajectories up to 600 trajectories in
this example. The optimal number of simulations to run depends on one’s tolerance
for statistical error. Currently an equal number of simulations are seeded from each
168
state. In the future, however, adaptive sampling [31] could be used to start an optimal
number of simulations from each metastable state to further optimize the efficiency of
this method.
There are a number of factors contributing to the improved efficiency of ASM.
By using short GE simulations to identify the metastable states, the ASM is able to
exploit the ability of GE simulations to rapidly cross energetic barriers while avoiding
the penalty incurred at high temperatures for entropic barriers by using seeding
simulations at low temperatures. Furthermore, only short seeding simulations are
necessary because only local, not global, equilibration is required, due to the use of
MSMs. Global equilibration metrics like reversible folding require that each
simulation is long enough to cross every barrier multiple times. Local equilibration,
however, may be obtained with many short simulations run in parallel because each
run only has to be long enough to cross a single barrier. By using MSMs to identify
the metastable states we can initiate seeding simulations from uncorrelated
conformations within every metastable state and thereby ensure every barrier is
crossed.
The ASM also has limitations. For example, the initial sampling has to be
broad enough to identify all the metastable states. Failure to do so will quickly become
apparent as some states will be populated in one dataset but not in another. This
situation may be remedied by iterating the ASM: that is, seeding ST simulations from
each state to obtain broader sampling, building an MSM to identify the metastable
states, and performing new constant temperature seeding simulations. In addition,
seeding simulations at physiological temperatures are only able to cross barriers on the
order of a few kT. However, this should be sufficient for most biological systems.
Finally, we note that the random selection of initial configurations from each state
may lead to some error if the seeding simulations are not long enough. In the future,
this method might be improved by choosing initial configurations from an equilibrium
distribution prepared within each metastable state(54).
169
EXAMINING THE STATES
Figure 48B shows that the short folded ST data spent a disproportionate amount of
time in state 2 while the coil ST data spent a disproportionate amount of time in state
4. Based on this result, we hypothesized that state two is the native state and that state
four is a random coil state. To test this hypothesis we extracted representative
structures for each state. The representative structure for each state is the configuration
with the greatest density of nearby conformations (mathematically this is the
conformation with the minimal RMSD to every other conformation in the state).
Figure 50 shows the representative structures for each state. In fact, state two is
the folded state, having a well-formed two base pair stem. Our ability to identify this
state without including any knowledge of the native state and the fact that it is the
most populated state lends credibility to the force field used, AMBER99(60).
Furthermore, state four is a random coil. The other states represent various collapsed
non-native states. For example, state 1 has native-like base stacking interactions but no
clear base pairing between the two sides of the stem. State 3 has interactions between
bases 1 and 8 as well as 2 and 7, but they are stacking interactions instead of base
pairing interactions. These results are consistent with both experimental and
computational work showing that small RNA hairpins have folding intermediates with
contacting end residues but without well-formed stems.(52, 222) Fully validating the
force field will require longer simulations to get accurate kinetic predictions and more
extensive comparisons with experimental observables.
170
Figure 50. Representative structure for each of the six metastable states. The numbering is the same as
in Figures 48 and 49.
CONCLUSIONS
We have introduced the Adaptive Seeding Method (ASM) and shown that it samples
significantly more efficiently than GE simulations, which have found widespread use
in studying biological systems(27, 28, 219, 237), for a 2D simple potential and RNA
hairpins. The ASM takes advantage of the broad sampling possible with GE methods
but can more effectively cross entropic barriers using constant temperature
simulations. Moreover, by requiring local equilibration rather than global equilibration
only relatively short simulations are necessary and these simulations may be run in
parallel, rendering the calculation particularly well suited to modern computing
clusters. MSMs are then used to extract global equilibrium populations from these
short simulations. Besides serving as an efficient sampling algorithm, the ASM also
may be used to recover equilibrium properties from non-equilibrium datasets. Thus,
the ASM holds great promise for validating force fields and bridging the gap between
experimental and computational timescales.
In the future, we plan to apply the adaptive seeding method to larger systems.
We also hope to explore alternative sampling methods for identifying initial states. For
example, coarse-grained simulations could be used to identify the dominant states of a
system and seed all-atom MD simulations that would elucidate the atomic details of
the free energy surface. Alternatively, implicit solvent simulations run at low viscosity
could be used to rapidly identify the dominant states and seed explicit solvent
simulations to provide more accuracy. Finally, adaptive sampling(161) with longer
simulations may be used to obtain accurate kinetics from MSMs.
171
MATERIALS & METHODS
Two distinct sets of ST simulations were run: one started from a folded state and the
other from a random coil. An independent MSM was then built for each dataset to
identify the dominant metastable states. We use MSMBuilder(10) to build an MSM.
At first, conformations were first split into a large number of microstates using a
hierarhical K-medoids clustering algorithm with the all heavy atom RMSD as the
distance metric (e.g. we generated 1,597 microstates for long ASM seeding runs).
Kinetically related microstates were then lumped together using PCCA (6, 44, 45).
One hundred random conformations were then chosen from each state and used as
starting points for constant temperature 300K MD simulations, still maintaining two
distinct sets of simulations. New MSMs were built from these constant temperature
datasets. A Bayesian method (251) was used to calculate the populations of each state
with error bars and the models were compared based on these values. The original ST
simulations were also extended to match the sampling of the constant temperature
simulations. State populations with error bars for these long ST runs were computed
using bootstrapping and compared to the populations from the constant temperature
simulations. More details are available in the SI.
172
APPENDIX A: ESTIMATING TRANSITION MATRICES AND EQUILIBRIUM
DISTRIBUTIONS
Given our simulation data and assignments thereof to states, it is necessary to estimate
the transition probability matrix and the corresponding equilibrium distribution. We
have experimented with a number of such methods, all of which give results that are
similar to within error for this data set. However, this property should not be assumed
of other data sets a priori.
First, we show the standard method for estimating the transition probability
matrix T(τ) (or just T for simplicity). The entries of T are the probabilities of
transitions from state i to state j in time τ, that is, Tij = P(i→j). To estimate this, let Cij
= C(i→j) be the number of observed transitions from i to j. Then a reasonable estimate
(a maximum likelihood estimate) is Tij=Cij / Ci, where
(A1) j
iji CC
is the number of observed transitions starting in state i.
To estimate the equilibrium distribution of T, one merely has to find the
stationary eigenvector of T. Under ideal conditions (if the model is ergodic and
irreducible) (253), the stationary eigenvector e is unique and can easily be computed
by repeated multiplication of some initial probability density by T, as in Equation A1.
Similarly, one could use standard eigenvalue routines to find the eigenvector
corresponding to an eigenvalue of 1.
A possible problem with the standard estimate for T is that the resulting model
might not satisfy detailed balance
(A2) jijiji TeTe
173
where ei is the equilibrium probability of state i. The naïve solution to this is to
symmetrize the count matrix by adding its transpose, which amounts to including the
counts that would have arisen from viewing the simulations in reverse. Clearly this
procedure is inappropriate for situations not at equilibrium; nonetheless, we sometimes
find this procedure useful for equilibrium data due to its ease. Furthermore, if the
underlying count matrix is symmetric, one can show that the equilibrium distribution
can be obtained simply by dividing the number of observations in each state by the
total number of observations.
A somewhat more complicated procedure to ensure reversibility is using a
maximum likelihood estimate constrained to the set of models satisfying detailed
balance. To achieve this, assume that we are given the observed count matrix C. By
exploiting the equivalence between this count matrix and a random walk on an edge-
weighted undirected graph (160), we then estimate an additional count matrix, X,
which we require to be symmetric. We compute X by maximizing the likelihood of X
given C; this assumption gives a set of equations that allow the self-consistent
calculation of X. More formally, if C is the observed counts, and X is a symmetric
matrix that approximates C, then the likelihood is
ijC
ji i
ij
X
XCXL
,
)|( (A3)
Maximizing the likelihood yields the following equation, which we solve by self-
consistent iteration,
j
j
i
i
jiijij
X
C
X
C
CCX
(A4)
where Ci and Xi are defined as the row sums of C and X, respectively, as in Equation
A1. In our experience, this method works but it can be slow for the large matrices we
174
consider. Furthermore, statistical noise in the count data can dominate the resulting
equilibrium distribution and even cause the self-consistent iterations to diverge.
A final method is that of Bacallado et al. (254), which uses Bayesian inference
with a prior on the space of matrices satisfying detailed balance. This method is
formally the most sound, as it uses Bayesian inference and includes a powerful prior.
However, it is much more computationally demanding than the other methods. Thus,
this method was also applied to the data in order to assess the validity of the simpler
methods.
We find that the four methods mentioned above give similar results for the
underlying equilibrium distribution of this dataset, indicating that we have achieved
equilibrium sampling. As such, we have used the naïve method of symmetrizing the
matrix due to its computational efficiency (and the fact that we have so much data,
that our data set is very close to having reached equilibrium). However, in general, we
stress that either the maximum likelihood or Bayesian methods should be used.
175
APPENDIX B: THE POSSIBILITY OF LONGER TIMESCALES THAN THE
IMPLIED TIMESCALES
Here we show a simple model demonstrating that the rates for transitioning between
some states in an MSM under a two-state assumption (as used in the maximum
likelihood approach of Ensign et al. (58)) may be slower than the implied timescales.
First we define a four state system that satisfies detailed balance
0.998 0.001, 0.001, 0.000,
0.001 0.998, 0.000, 0.001,
0.050 0.000, 0.949, 0.001,
0.000 0.001, 0.050, 0.949,
)(T
This system is depicted in Figure 51A.
The eigenvalues of this system are 1, 0.997, 0.95559, and 0.94141 and
we will assume a lag time of 1 in arbitrary units. Thus, disregarding the eigenvalue of
one corresponding to the equilibrium distribution, there are three implied timescales:
332.785, 22.0139, and 16.5627.
We can write the probability of transitioning between two states as
(B1) /1 ep
where ω is the average timescale for the transition (this notation deviates from the
standard notation of τ but avoids confusion with the lag time). Rearranging, we find
)1ln( p
(B2)
Plugging our transition probabilities into this equation we arrive at the average
timescales for transitioning between each pair of states shown in Figure 51B. Many of
176
these timescales are as high as 1,000 units, much greater than the largest implied
timescale of ~332 units. In principle, one could monitor these average timescales,
resulting in apparent timescales longer than the implied timescales of the system.
Figure 51. Graph depiction of the model system defined in Appendix B with edges labeled by A) their
probability and B) their average timescale under a two-state assumption.
177
APPENDIX C: SUPPORTING INFORMATION FOR CHAPTER 3
MOLECULAR DYNAMICS SIMULATION
Distributed molecular dynamics simulation on GPUs were performed using an
accelerated version of GROMACS (255) written specifically for GPUs (80) using the
Folding@Home platform (79). The AMBER ff96 (60) forcefield was used with the
generalized Born/surface area (GBSA) implicit solvent model of Onufriev, Bashford
and Case (81). AMBER ff96 has been reported to have more accurate secondary
structure propensities when used with GBSA (82). Up to 10,000 parallel simulations
(each with randomized Boltzmann-distributed initial velocities) were simulated at
300K, 330K, 370K and 450K, from several different initial starting states. Due to the
nature of distributed computing, in which uncoupled simulations are used to produce
successive trajectory segments, a broad distribution of trajectory lengths is obtained
(see Figure 11b in the main text). Stochastic integration was performed using a time
step of 2 fs and Berendsen temperature coupling. A water-like solvent (shear)
viscosity of 91 ps-1 was used, with full O(N2) electrostatic and vdW interactions.
Hydrogen bond lengths were constrained using the SHAKE algorithm. Trajectory
snapshots were recorded every 1 ns.
Starting conformations for the native state of NTL9 were taken from the
crystal structure 1DIV, and steepest-descent minimized for 5000 steps. (Minimization
was done using the GBSA model of Still et al. (256)) Five starting conformations for
the random coil ensemble were taken from snapshots of Monte Carlo trajectories in
which dihedral angles were randomized under a potential rewarding compact Rg. The
dihedral probabilities came from the TOP500 database (257). Starting conformations
for extended structures were constructed by setting dihedral angles to their canonical
values.
178
MARKOV STATE MODEL (MSM) CONSTRUCTION
We used the MSMBuilder package (10), modified to use sparse matrices (4), to
construct an MSM for NTL9(1-39). First, 100,000 microstates were generated by
clustering conformations separated by 10 ns. The remaining 90% of the data was then
assigned to these clusters. The resulting microstates had an average radius of ~4.5 Å,
where the radius of a cluster is defined as the largest distance between any
conformation in that cluster and the cluster center. The implied timescales were then
calculated for lag times from 1 to 32 ns at 4 ns intervals and found to level off at ~12
ns (Figure 52a), implying a 12 ns Markov time. Finally, we generated a macrostate
model (Figure 53) by lumping microstates into 2,000 macrostates using the PCCA+
algorithm (44) and verified that the implied timescales still leveled off on a similar
timescale (Figure 52b).
We have confirmed the statistical accuracy of our equilibrium populations for
the 2,000 state model using a Bayesian method (251). This analysis reveals that the
statistical uncertainty in the population of any state ranges form 0.2% to 2% of that
state's population (0.7% on average). Unfortunately, it is not possible to rigorously
address any systematic error in our model without an independent data set to compare
to.
TRANSITION PATHWAY THEORY (TPT) ANALYSIS
Many of our calculations are modeled on those in (3, 87, 258). However, we have
chosen a slightly different algorithm for decomposing the reactive flux into individual
pathways. Given a folded state B, an unfolded state A, and the matrix of net reactive
flux F, our greedy backtracking decomposition works as follows:
1. Start at the folded state B. Label this state x1.
2. Choose the state whose net flux into x1 is maximal.
179
3. Next, choose the state x2 such that the net flux from x1 to x2 is maximal.
We repeat this process for each state xn-1, choosing the next state xn such that the flux
from xn-1 to xn is maximal, until we reach state xn = A.
Upon completion, we have produced a series of states (x1, … , xn) defining a
pathway. We define the flux along this pathway as the minimum of the fluxes,
min(F(xi xi+1)). We then subtract this flux from each of the pathway's edges in the
original flux matrix. Finally, we repeat the same algorithm on the new flux matrix to
produce additional pathways.
The result of this algorithm is a set of pathways and their associated fluxes.
STRUCTURAL ANALYSIS OF MACROSTATE ENSEMBLES
Because macrostate conformational ensembles can be somewhat heterogenous and
diffuse, we used a metric that quantifies the extent of native-like structure without
using predetermined reaction coordinates or requiring artificial thresholds for native
contacts, which we call the Q-value.
For each macrostate, we define a vector c(x) indexed by x = (i, j), denoting a
contact between residues i and j. The entries of c(x) are continuous (non-integer)
values between 0 and 1, representing the fraction of the ensemble for which the alpha-
carbons of residues i and j are closer than 8Å. We will call c(x) a contact profile. We
define the Q-value of a given c as its projection onto the contact profile of the “native”
macrostate (state n), cnat.
180
The Q-value for the “native” macrostate (state n) is unity, and less native-like contact
profiles will have lower Q-values. Because a contact profile can only contain entries
between 0 and 1, Q is always positive.
Moreover, we also define Q-values for particular structural elements by
restricting a contact profile to a particular subspace of contacts. For example Q β12 is
the Q-value when c is restricted to a subspace where x β12, a set of contacts
corresponding to pairings between beta-strands β1 and β2. We examined Q-values for
three native structural elements: Qα, Qβ12, and Qβ13, based on the subsets of the
“native” (state n) contact profile (Figure 54). For clarity, we call the Q-value for the
entire set of contacts Qtotal.
ANALYSIS OF STATES ALONG FOLDING PATHWAYS: COMPARISON
BETWEEN SECONDARY STRUCTURE FORMATION AND REACTION
PROGRESS (PFOLD)
How heterogeneous are the possible pathways for folding? One way to examine this
question is to compare the secondary structure formed in a given state versus its
position along the reaction pathway. In Figure 55 and Figure 56, we use a simple
metric to plot the secondary structure bias, namely the difference between alpha
helical and beta sheet contacts Qα – (Qβ12 + Qβ13)/2 of a given state and compare this
to the position of the state along the reaction pathway as determined by its commitor
or pfold value. From these figures, it can be seen that 1) the “unfolded” state (a)
contains residual native-like helical propensity, and 2) pathways involving various
ordering of native-like helix and sheet formation are possible.
The contact profiles (see Figure 55) for these states demonstrate the existence
of non-native contacts in some states as well as the fact that certain contacts are
present more commonly in a given state. For example, states h, i, j, and k all have a
181
mixture of some contacts which are very prevalent (dark black) and some which are
only partially formed (light gray to gray), whereas state g has fewer contacts, but all
prevalent. The nature of the heterogeneity even within a state highlights the ensemble
nature of this form of analysis, as well as the degree to which a given state is
structured (and in which parts).
Finally, we find a variety of degrees of structure in a given state for the natural
independent folding units found (i.e. the alpha helix, β12, and β13). This is shown in
Figure 57, where we see a significant diversity present in state and pathways in terms
of the secondary structure formed.
HOW DOES NTL9 FOLD IN OUR SIMULATIONS?
In order to understand how NTL9 folds, a natural approach is to analyze the pathways
found in terms of existing theories for protein folding. The highest-flux pathways in
our mesoscopic model are a→m→n and a→l→n. Both pathways are direct routes
from disordered to highly-structured macrostates, reminiscent of a nucleation-
condensation mechanism (259). This picture is consistent with the cooperative two-
state kinetics observed in stopped-flow re-folding experiments (78). While these
pathways show concomitant formation of helix and hairpin structures, the intervening
states l and m differ (mostly) in the β12 hairpin registration (see Figure 57). The large
pfold values of states l and m, and their obligate presence in the two highest-flux
pathways from a to n, suggests that to some extent, the states l, m and n can be
considered a very native-like “molten-globule”, in which the details of tertiary
arrangement are sorted out after overcoming the main barrier to folding. Kinetics
between such metastable stables would be difficult to detect experimentally using a
single fluorescent reporter, and in the nucleation-condensation view such events might
be described using the encompassing term “condensation”.
At the same time, the structural diversity along the folding pathways we
analyze corresponds well with many models of hierarchical folding. In general, the
182
macrostates with low pfold values have a baseline of native helicity, with the full
extent of native beta-sheet structure occurring later in the folding reaction (Figure 56).
This is consistent with the idea that local structures such as helices form early, with
non-local structures such as beta-hairpins and beta-sheets forming later in the reaction.
Macrostates b through f (which have low pfold values and are involved early in the
folding reaction) contain a variety of distinct non-native structural elements,
particularly non-native hairpin and sheet arrangements (see Figure 13, and Figure 55).
This is reminiscent of hierarchical mechanisms such as diffusion-collision (260) where
competing ‘foldons’ (86) form as kinetically metastable units, and are cooperatively
stabilized when in a native-like arrangement. The heterogeneous sequences of
secondary structure formation in pathways a→h→k→m→n (in which the central helix
forms first) versus pathway a→g→l→n (in which the hairpin structure forms first)
suggest that independent folding units can form and coalesce in any order.
We stress that there need not be a single pathway or single, dominant
mechanism for folding. Moreover, the various theories proposed for how proteins fold,
such as a diffusion-collision or nucleation-condensation mechanism, are based on
physical principles broadly relevant for proteins. Therefore, it is natural to imagine
that multiple mechanisms could be simultaneously present, but that the sequence of
the protein, coupled with the chemical environment (solvent conditions, temperature,
pH, etc), would control the balance of the degree to which each mechanistic pathway
is seen.
183
Figure 52. (a) Implied timescales for a series of 100,000-microstate Markov State Models (MSMs)
built at lag times between 1 and 32 ns. As the longest timescale levels off beyond a lag time of 10
ns, a lag time of 12 ns was chosen to build subsequent MSMs. The spectral gap present at all lag
times indicates apparent two-state folding kinetics. (b) The implied timescales for a 2000-
macrostate model built by lumping states from the microstate MSM show a similar spectral gap
and leveling off of time scales. The faster implied timescales of the macrostate model at short lag
times are due to lumping effects. (c) The 10 slowest implied timescales for the 2000 state models,
with error analysis from a bootstrapping procedure. Error bars represent the standard deviation
from the bootstrap analysis.
184
Figure 53. A scatter plot of the 2000 macrostates obtained by lumping the 100,000-state MSM
calculated from the simulation data at 370K. The RMSD-to-native is calculated using the peptide
backbone residues, with respect to the native starting state. The free energy of each microstate i is
computed as –kT ln (pi /p0), where pi is the equilibrium probability of the microstate, and p0 is an
arbitrary reference (in this case, max(pi)). Shown in red are the 14 macrostates transited by the top
ten pathway fluxes, labeled with the same letters as in Figure 13. In this mesoscopic view, we find
that 1) the macrostates are diffuse collections of conformational states, 2) there are multiple folding
pathways along these metastable states, and 3) we can identify highly populated “native” (state n)
and “unfolded” (state a) macrostates that dominate the observed relaxation rates. The red arrow is
meant to guide to eye in illustrating a “mesoscopic” view of the transition state barrier: the
“unfolded” state (a) and “native” state (n) are at free energy minima, while intermediate RMSD
values have macrostates with higher free energies.
Figure 54. Contact profile subspaces used to calculate Q, Q12, and Q13, which quantify the extent of
native-like structuring for beta-strand 1 and 2 pairing, beta-strand 1 and 3 pairing, and helix
formation, respectively.
185
Figure 55. Here, contact profiles (see definition above) for the 14 macrostates involved in the top ten
folding pathways are plotted in a similar fashion to Figure 55. For clarity, the pathway arrows have
been removed. Each contact profile is a 39 x 39 matrix of inter-residue contacts, showing the
contact fraction on a linear grayscale from 0 (white) to 1 (black).
Figure 56. Here, values of Q (yellow), Q12 (red), and Q13 (blue) are plotted in a bar graph for each of
the 14 macrostates involved in the top ten folding pathways. The layout is in a similar fashion to
Figure 56.
186
Figure 57. Macrostates l, m and n (the “native” state) have very similar structural ensembles and similar
pfold values (pfold > ~0.93). To examine the subtle differences in their macrostate contact profiles,
we computed difference contact profiles for (l-m), (n-l) and (n-m) transitions. These difference
maps reveal that these states differ mostly in their hairpin registrations and packing of the hairpin
loop.
187
APPENDIX D: SUPPORTING INFORMATION FOR CHAPTER 4
VILLIN MSM
LUMPING INTO MACROSTATES
To identify metastable states in villin, we lumped kinetically related microstates into
500 macrostates (all having self-transition probabilities >0.5) using the PCCA+
algorithm (20, 261). Figure 58 shows the implied timescales for this macrostate MSM.
While they are somewhat shorter than at the microstate level (4), their leveling off at
lag times of 10-15 ns indicates that the model is Markovian at these timescales (6, 34).
Mean first passage times (MFPTs) between pairs of states can then be
calculated as in Ref (161). The equilibrium probability of each state can be obtained
by normalizing the first eigenvector of the transition probability matrix. Finally, the
relative entropy between two MSMs is calculated as in Ref (18)
N
ji ij
ijiji Q
PPPQPD
,
log)||(
where Pi is the equilibrium probability of state i, Pij is the probability of transitioning
from state i to state j during one lag time, N is the number of states, P is the reference
model, and Q is a test model (in this case generated from a subset of the data).
Figure 62 shows the relative entropy for varying numbers of simulations (up to
40,000) of a given length (up to 400 nanoseconds). This figure highlights the fact that
too small numbers of too short simulations are less valuable than a single long
simulation with an equivalent aggregate amount of data but that the simulation length
at which this breakdown occurs decreases for increasing numbers of simulations.
188
ights between states
under the sim
transition count matrix with Cij giving the number of observed i→j transitions and
Ci th
P tra by normalizing each row of C (
BARRIER HEIGHTS
As further confirmation of the relevance of our macrostates, we have developed a
simple Bayesian approach for estimating the free energy barrier he
plifying assumption that we can examine pairs of connected states
independently. To begin with, we make the following defintions:
C
e number of counts originating in state i
i
ijij C
CP ) nsition probability matrix obtained
‡ijG barrier height for i→j transitions
ij attempt frequency for i→j transitions
ijn number of attempted i→j transitions
ijk rate of i→j transitions
Bk Boltzmann’s constant
T temperature (300 K for this work)
lag time of the MSM
The quantity we wish to obtain is the posterior distribution over barrier heights
given our data (the count matrix, C). However, to account for the attempt frequency
we must begin with
189
)(
),(),|()|,(
‡‡‡
P
GPGCPCGP ijijijij
ijij
C
where the equality comes from applying Bayes’ rule. We can then integrate out the
attempt frequency to obtain
)(
),(),|()|(
‡‡
CP
GPGCPCGP ijijijijij
ij
‡
We now assume that the barrier height and attempt frequency are independent
and assign them each a uniform prior.
lso put bounds on the attempt frequency from our observed data by
recognizing that the number of attempts is at least as many crossing events as were
observed and no greater than the number of observed crossings plus the number of
self-transitions.
cP
PGPGP
ij
ijijijij
)(
)()(),( ‡‡
cGP ij )( ‡
where c is a constant.
We can a
i
iiijij
i
ij
C
CC
C
C
where the denominator gives the total time spent in state i and the numerator gives the
number of attempts at transitioning from state i to state j. Using nij to denote the
number o pts at transitioning from state i to state j, we can also write f attem
i
ijnij C
190
Using these priors and bounds, we obtain
ijij cnij CP )(
iiij
ij
cc ijiji
ijijij
ijijijijij
nPGPC
nGCP
CGP
CP
PGPGCPCGP
)()(),|(
)|(
)(
)()(),|()|(
‡‡
‡
‡‡‡
Given a particular value of the attempt frequency (or equivalently, the number of
attempts) we can write
ijijij
ijij
Cnij
CijCn
i
ijijij PPC
C
nGCP 1),|( ‡
where denotes “n choose c”. Using the simple rate equation ijij Cn C
ijij kP exp1
and, from transition state theory,
)exp(‡
TkBijij
Gk ij
we finally obtain
ijij
Biji
ijij
Biji
ij
ijij
ijijBijij
ijBijij
ijij
CnTkG
C
nCTkG
C
n
CniC ij
ijij
CnTkG
CTkG
Cni
ijijij
eeCn
GCP
eeCC
nGCP
‡‡
‡‡
expexp‡
expexp‡
1),|(
1),|(
The denominator, P(C), is then obtained by normalization.
191
ted the expected barrier height for each
transition. The mean expected barrier height is 5.9 (+/- 2.5) kT, indicating that most of
tates are poten
As a consistency check, it is also possible to solve for the barrier height in
terms of the observed counts and attempt frequency.
Using these equations gives a posterior distribution of barrier heights for every
possible transition. Since there are thousands of possible transitions, it is impractical to
examine them all. Instead, we have calcula
the s tially detectable (separated by reasonable barriers) and that the
distribution of barrier heights is quite broad.
i
ij
iBij
C
nTkG ln‡
ij
C
C1ln
that these encompass the Bayesian result.
TRANSITION COUNT MATRICES
The transition count matrices for simple models S, P, and H (CS, CP, and CH
respectively) are
One can then plug in the lower and upper bounds for the attempt frequency and ensure
SIMPLE MODELS
000,9030000
3000,13000
03000,1300
003000,130
0003000,13
00003000,6
SC
192
000,9022000
2000,10220
20000,1220
022000,102
0220000,12
00022000,6
PC
000,9002220
0500,32000
22000,1200
202000,120
2002000,12
00002500,3
HC
where the entry in row i and column j gives the number of transitions observed from
state i to state j. State 0 is unfolded, 1-4 are intermediates, and 5 is the native state.
To generate synthetic simulations from a transition count matrix we first
normalize each row to obtain a transition probability matrix. At each time step (or
each lag time), the next state is chosen according to the distribution of transition
probabilities for the current state.
FOLDING SIMULATIONS
MFPTs from the unfolded state(s) to the native state, given in Table 1, were calculated
following Ref (161). The distribution of first folding times was determined by first
running 10,000 simulations of 50,000 steps each started from state 0 (for model H half
the simulations were started from state 4). The first folding time of each simulation
was then calculated and these values were plotted as a histogram with 100 bins. The
lag phase was determined by finding the first folding time with the maximum
probability. Exponential fits were calculated by fitting to the first 50 bins after the lag
phase (to avoid noise in less populated bins at longer first folding times). The
exponential fits and lag phases are also given in Table 1. Similar results were obtained
193
by randomizing matrix elements while maintaining the network topology, subject to
the constraints of detailed balance and metastability. Example matrices include
000,9030000
3000,13000
03500300
003000,130
0003000,33
00003000,6
,randSC
and
000,9004292700
0500,397000
4297000,11700
92017000,1680
700068000,151
000051500,3
,randHC
194
Figure 58. Implied timescales for the villin macrostate MSM.
Figure 59. Distribution of MFPTs between all pairs of non-native states for villin (A) on a linear scale
to demonstrate the peak does not shift significantly relative to the distribution shown in Figure 18B
and (B) on a log scale to highlight that the tail of the distribution does extend to about 60 ns.
195
Figure 60. Distributions of the MFPTs (A) from each non-native state to the native state and (B)
between every pair of non-native states for our 2,000 state NTL9(1-39) model. As discussed in Ref
(93), further refinement of this model is likely necessary. However, we do not expect the
qualitative trend of long timescales (relative to folding) for transitioning between unfolded states to
change.
Figure 61. Two conformations from different unfolded basins demonstrating the structural
heterogeneity of non-native states (especially in their non-native contacts) that, in combination with
the vastness of conformational space, result in slow transitions between unfolded states. The
structures are colored red to blue from the N-terminus to the C-terminus. Atoms for residues Arg
14, Trp 23, and Lys 32 are shown to highlight that 23 and 32 are in contact on the left while the
196
chain has rearranged such that 14 and 32 are in contact on the right. These images were made with
VMD (67).
Figure 62. Relaxation of the fraction folded starting from equally populated unfolded states (black is
data and blue is single exponential fit with τ≈810 ns). The beginning of the curve is dominated by
single exponential relaxation but deviations from this apparent two-state behavior become apparent
later.
197
Figure 63. Relaxation of the fraction unfolded for a villin model at the microstate level (thick black
line) and a biexponential fit (thin blue line) with time constants of ~60 and ~415 ns, at least
qualitatively consistent with time constants of ~70 and ~720 ns from experiment (56). We hope to
explain this behavior in a future work on villin. As in Ref. (4), the native state was defined as all
microstates with an average Cα RMSD to the crystal structure less than 3 Å.
198
Figure 64. The distance to the gold-standard model, measured via the relative entropy, for 40,000
trajectories up to 400 nanoseconds in length. The black lines are contours of equal amounts of
data. Again, there was insufficient data to resolve the upper right-hand corner of the plot.
Model Exponential Fit MFPT Lag Phase S 12,800 13,400 2,500 P 4,500 5,000 1,000 H 3,300 3,600 800 Table 1. Exponential fits, MFPT’s, and lag phases (all in units of steps) for transitioning from the
unfolded state(s) to the native state in the three simple models.
199
APPENDIX E: SUPPORTING INFORMATION FOR CHAPTER 5
SIMULATION DETAILS
Six initial starting conformations covering a range of 0 to 13 Å Cα RMSD to the
crystal structure were drawn from replica exchange simulations in implicit solvent
from Bill Swope and Jed Pitera at the IBM Almaden Research Center (136). These
conformations were energy minimized using a steepest-descents algorithm in the
Gromacs simulation package (43) with the AMBER03 force field (60). They were
then solvated in tip3p water and the solvent was equilibrated at 300 K with the protein
coordinates held fixed. Finally, simulations were run on the Folding@home
distributed computing platform using an MPI-enabled version of Gromacs (58) at both
300 and 370 K. The details of this procedure are identical to those used in Ref (58)
and a full description can be found there. Most of the results described in this work are
from the 370 K data. This temperature was chosen to approximate the experimental
melting temperature, correcting for the fact that simulations tend to over-estimate the
melting temperature for this system (136).
Structures were rendered with PyMOL.
MSM CONSTRUCTION AND ANALYSIS
We used the MSMBuilder package (4, 10) to construct a microstate model with 30,000
states and a coarse-grained macrostate model with 5,000 states. The microstate model
was generated by clustering conformations stored at 5 ns intervals based on their Cα
RMSDs using the k-centers algorithm in MSMBuilder. The remaining data (50 ps
spacing) was then assigned to these clusters and used to construct a transition count
matrix (Cij = the number of observed transition from state i at time t to state j at time
t+τ, where τ is the lag time of the model) and corresponding transition probability
matrix (Pij = probability of transitioning from state i at time t to state j at time t+τ,
200
where τ is the lag time of the model). The PCCA+ algorithm (20, 44, 261) was then
used to lump kinetically related microstates into 5,000 macrostates and these state
definitions were used to construct macrostate level transition count/probability
matrices.
The lag time for each model was selected by computing the implied timescales
of the model
)ln(
k
where μ is an eigenvalue, τ is the lag time, and k is a rate. This equation comes from
the equivalence between discrete time MSMs and continuous time master equations
(see Refs (6) and (3) for details). By plotting the implied timescales as a function of
the lag time one can identify the lag time at which they begin to level-off (satisfy the
Chapman-Kolmogorov test), indicating that the model is Markovian (34). Based on
this analysis, we chose a lag time of 5 ns for our microstate model (Figure 65), where
all the kinetic analyses in this work were performed.
To calculate the relaxation of the fraction folded as measured by some
observable we used the procedure from Ref (58) to distinguish folded and non-native
states and the procedure from Ref (4) to propagate the fraction folded. For example,
with the experimental surrogate (Trp22-Tyr33 quenching) we calculated the average
and standard deviation of the distance between these residues (Nativeave and Nativestd
respectively) in native-state simulations started from a model of D14A based on the
1LMB crystal structure. Five random conformations were drawn from each state and
used to calculate the average distance between these residues for that state (Stateave).
A state was considered to be native if Stateave < Nativeave - Nativestd and non-native
otherwise. The fraction-folded can then be calculated as the dot-product between a
vector with 1’s for folded states and 0’s for non-native ones with the state populations.
To mimic an ensemble T-jump we used two starting populations: 1) all states equally
populated and 2) all microstates in non-native macrostates (i.e. outside the most
201
populated macrostate) equally populated. The relaxation of these starting ensembles
was modeled by propagating the populations forward in time with the transition
probability matrix and calculating the fraction folded at each time step. The same
procedure was used for the fraction folded determined by the RMSD to the crystal
structure, which was examined to determine whether or not the Trp22-Tyr33 distance
could be measuring a more local rearrangement than full folding, as proposed for
villin (58). Figure 73 shows that these two observables gave similar timescales for the
full MSM and, while differences are apparent when the simulations started from β–
sheet structures are ignored, the timescales do not appear to be substantially slower for
the RMSD relaxation (Figure 74). The molecular and activated timescales (τm and τa
respectively) were obtained by fitting to the biexponential
CBeAe am tt //
where t is the time and A, B, and C are constants.
The states participating most strongly in a given transition mode are specified
by the corresponding left eigenvector (states with negative components are
interconverting with those with positive components, and the magnitude of the
eigenvector component gives the degree of participation) (1). The highest flux
pathways between sets of state were calculated as in Refs (258) and (5). Mean First
Passage Times (MFPTs) between states and Pfolds were calculated as in Ref (37).
Given our finite sampling, one can estimate the kinetic connectivity of a state
by counting the number of edges connecting it to other states (effectively a way of
counting the number of edges with probabilities above some threshold since all
connections would be made with infinite sampling).
Two residues were considered to be in contact if any pair of atoms was within
7 Å. Native contacts are those formed in the energy minimized model based on the
crystal structure 1LMB (130, 131). Solvent accessible surface areas were measured
202
using the g_sas program from Gromacs (43) with a 1.4 Å probe radius. The distance
between two residues is the distance between the centroids of their side chains.
Figure 65. Implied timescales for the full 370 K dataset.
Figure 66. Implied timescales for the 300 K dataset.
203
Figure 67. Implied timescales for ¾ of the 370 K dataset selected at random.
Figure 68. A coarse-grained view of the slowest transition with state sizes proportional to the free
energy and arrow widths proportional to the flux (see key in figure).
204
Figure 69. Another coarse-grained view of the slowest transition with state sizes proportional to the free
energy and arrow widths proportional to the flux (see key in figure). Here the states are laid out in
terms of the average number of β-sheet residues (calculated from 100 random conformations from
each state) and the pfold (probability of reaching the crystallographic state in L before the compact
β-sheet state in A).
205
206
Figure 70. Free energy projections of the microstate MSM onto typical order parameters like the radius
of gyration (Rg), the Cα RMSD to the crystal structure, and the distance between the Trp22 and
Tyr33 residues. Differences between the two panels highlight the difficulty in interpreting such
projections.
Figure 71. Free energy projection of the microstate MSM onto Pfold and the distance between the
Trp22 and Tyr33 residues. Obtaining projections onto kinetic order parameters like Pfold is greatly
simplified with MSMs. In this case Pfold refers to the probability of reaching the crystallographic
state before reaching the compact β-sheet state (i.e. the slow transition from Figure 21). Unlike the
projections in, this one hints that D14A may not be well described by a simple two- or three-state
model or that the Trp22-Tyr33 distance is not a good reaction coordinate, since there are a broad
range of Pfold values possible for a given Trp-Tyr distance. Indeed, analysis of the MSM reveals
that D14A is best described by a native hub.
207
Figure 72. The ten most populated macrostates with their equilibrium probabilities.
Figure 73. Relaxation of the fraction unfolded with different observables and observation times. The
thick black curves come from the MSM and the thin blue curves from biexponential fits to the
MSM relaxation. The top row shows relaxation of the fraction unfolded measured by the Trp22-
Tyr33 distance (A) starting from all states being equally populated and (B) starting from all non-
native states being equally populated. The bottom row shows relaxation of the fraction unfolded
measured by the Cα RMSD to the crystal structure (C) starting from all states being equally
populated and (D) starting from all non-native states being equally populated. Fitting parameters
208
are given in the figure (in units of microseconds). In this case, the fitting parameters are relatively
independent of the observable and starting distribution.
Figure 74. Relaxation of the fraction unfolded with different observables and observation times from an
MSM built without the trajectories started from β-sheet structures. The thick black curves come
from the MSM and the thin blue curves from biexponential fits to the MSM relaxation. The top row
shows relaxation of the fraction unfolded measured by the Trp22-Tyr33 distance (A) starting from
all states being equally populated and (B) starting from all non-native states being equally
populated. The bottom row shows relaxation of the fraction unfolded measured by the Cα RMSD to
the crystal structure (C) starting from all states being equally populated and (D) starting from all
non-native states being equally populated. Fitting parameters are given in the figure (in units of
microseconds). In this case the fitting parameters are more dependent on the observable, consistent
with the experimental observation of probe dependent kinetics.
209
Figure 75. Projection of the free energy onto pfold (A) from the compact β-sheet state in Figure 22A to
the native state in Figure 22H, (B) from the extended state in Figure 22E to the native state in
Figure 22H, and (C) from the extended state in Figure 22E to the native state in Figure 22G. None
are purely downhill, though some may be consistent with incipient downhill folding (i.e. have
sufficiently low barriers that there is a reasonable population at the barrier top that can fold in a
downhill manner in addition to activated folding across the barrier).
Figure 76. The helicity of each residue predicted from Agadir.(143) The purple, numbered bars show
where the five helices are (the extra purple block between helices 4 and 5 is a turn).
210
APPENDIX F: SUPPORTING INFORMATION FOR CHAPTER 6
Figure 77. Uncertainty in the log base 10 of the relative entropies averaged over 10 independent
samples of (A) reference simulations of M1 and (B) adaptive sampling of M1. Black lines are
contours of equal amounts of data.
Figure 78. Uncertainty in the log base 10 of the relative entropies averaged over 10 independent
samples of (A) reference simulations of M2 and (B) adaptive sampling of M2. Black lines are
contours of equal amounts of data.
211
APPENDIX G: SUPPORTING INFORMATION FOR CHAPTER 9
SERIAL REPLICA EXCHANGE (SREMD)
Molecular Dynamics (MD) is a powerful technique for exploring the conformational
space of biomolecules. However, MD simulations often spend a significant portion of
time trapped in local free energy minima. Replica Exchange Molecular Dynamics
(REMD) (22, 23) was developed to overcome this problem by inducing a random
walk in temperature space. In REMD, independent MD simulations are performed in
parallel at different temperatures. At regular intervals attempts are made to exchange
configurations between temperatures. These exchanges are accepted according to a
well defined transition probability. The REMD scheme requires synchronization of
different processors, which makes it unsuitable for a heterogeneous distributed
computing environment.
Serial Replica Exchange Molecular Dynamics (SREMD) (177, 224) is a serial
version of REMD that is suitable for distributed computing. In SREMD, a single
simulation performs a random walk in temperature space by making regular attempts
to swap temperatures. The transition probability for this move is determined by one
potential energy from the simulation and a second one from a pre-stored potential
energy distribution function (PEDF) at the new temperature. SREMD has been shown
to be an efficient sampling method when applied in a distributed computing
environment (177). However, we note that SREMD is only approximately correct
unless the exact PEDFs are adopted.
SIMULATION DETAILS
Our simulations used the AMBER 94 potential (262). The SREMD algorithm was
implemented in a version of the GROMACS (43) molecular dynamics simulation
package modified for the Folding@Home (79) infrastructure (http://folding.
212
stanford.edu). The RNA molecule was solvated in a water box with 3943 TIP3P (263)
waters and 11 Na+. The simulation system was minimized using a steepest descent
algorithm, followed by a 100ps MD simulation applying a position restraint potential
to the RNA heavy atoms. All simulations were run with constant NVT by coupling to
a Nose-Hoover thermostat with a coupling constant of 0.02ps-1 (63). A cutoff of 10 Å
was used for non-bonded interactions. Long-range electrostatic interactions were
treated with the Particle-Mesh Ewald (PME) method (264). Nonbonded pair-lists were
updated every 10 steps with an integration step size of 2 fs in all simulations. All
bonds were constrained using the LINCS algorithm (265) .
2,800 SREMD simulations with an aggregate simulation time of 54.6 µs
starting from the NMR structure (PDB code 1ZIH) (209) were performed. The
temperature list was roughly exponentially distributed, with 56 temperatures covering
a range from 285 to 592K. To obtain initial estimates of the PEDFs, we performed 56
3ns SREMD simulations where every move was accepted. For the Folding@home
(FAH) runs, the initial temperatures were uniformly selected from the temperature list.
Thus, there are 50 simulations starting from each temperature, each with different
initial velocities. The PEDFs were updated every 40ns for 40 iterations, then every
400ns for 20 iterations, and at last every 1000ns.
TOPOLOGICAL METHOD (MAPPER) FOR PATHWAY ANALYSIS
Our SREMD simulations generate a massive number of configurations. Therefore, it is
difficult to discern the structure of the data. Such data is normally dominated by the
folded and unfolded structures. However, we are interested in understanding structures
in transition states or intermediate states. Direct application of clustering algorithms to
all the configurations will be biased toward the densest regions (i.e. folded/unfolded
states in this study), making it difficult to identify the sparsely populated intermediate
states of interest. Furthermore, such clustering methods will not provide any
information on the connectivity between different clusters.
213
To address such issues, Yao et al. (228) proposed a topological data analysis
method to explore pathways in biomolecular folding, based on Mapper14, a general
topological data analysis tool for high dimensional data sets. This method efficiently
identifies intermediate states along a pathway. Roughly speaking, we use Mapper with
filters based on some conditional density function estimated from the data. Then the
data is divided into overlapping level sets based on the filter. Single-linkage clustering
is then used within each density level. Finally a graph is generated with a node
corresponding to each cluster and edges between pairs of nodes in neighboring level
sets that have non-zero overlap.
We note that clusters may be intrinsically non-convex in biomolecular folding
problems. K-means type clustering algorithms will fail for such clusters. The use of
single-linkage clustering in density levels in Mapper allows the efficient discovery of
non-convex clusters and separates sparsely populated intermediate states from the
dominant unfolded/folded states. For details on how such a scheme works, readers are
referred to [13].
PEDFS
Figure 79 (a) shows SREMD PEDFs from our massive distributed computing
simulations. The convergence of the PEDFs can be verified by the 2 convergence
measure. The 2 convergence measure is defined as an integrated error as shown
below (224),
2 2
1
( ( ) )N
refi i
i
P t P
where N is the number of bins in the potential histogram, Pi(t) is the value of the ith
bin of the potential energy histogram generated by potential energies collected over
time t at a particular temperature, and Prefi is the reference PEDF.
214
Figure 79 (b) displays the 2 convergence measure averaged over all
temperatures. When the final PEDFs are used as the reference distributions, 2(t)
decays to zero. On the other hand, when the PEDFs from the initial 3ns constant
temperature simulations (Pinitial) are used as the reference, 2(t) grows to a plateau
value. The 2(t) values for single temperatures show the same trends as these
averaged values. Therefore, the PEDFs have converged.
MELTING CURVES
Figure 80 shows the native contacts melting curve. The data demonstrates that folded
conformations dominate at low temperatures while extended structures dominate at
high temperatures.
Figure 79. (a) Potential Energy Distribution Functions (PEDFs) generated from Folding@home data at
each of the 56 temperatures used. (b). The 2 convergence measure averaged over all temperatures
as a function of time. Triangles correspond to using Pfinal as the reference distribution and circles
correspond to using Pinitial as the reference.
215
Figure 80. Native contacts melting curve. Only every third temperature is displayed for clarity.
216
APPENDIX H: SUPPORTING INFORMATION FOR CHAPTER 10
INITIAL CONFIGURATIONS
We started our ST simulations from two different initial configurations as shown in
Figure 81: a near-native state and a random coil. The near-native state was created by
analogy to the NMR structure of the GCAA tetraloop (first structure of PDB code 1zih
(209)). The random coil conformation was created with the Nucleic Acid Builder
(266).
Figure 81. The two initial structures used in this study: A) A near-native conformation and B) a random
coil conformation.
THE CONVERGENCE OF WEIGHTS IN SIMULATED TEMPERING (ST)
SIMULATED TEMPERING
In Simulated Tempering (ST) (24, 25), configurations are sampled from a mixed
canonical ensemble in which the canonical ensembles with different temperatures are
weighted differently as defined by a generalized Hamiltonian:
( , ) ( , )i i iX p H X p g (H1)
217
)
Where βi =1/(kBTi), H(X, p) is the Hamiltonian for the canonical ensemble at
temperature Ti. X denotes the conformation and p is the momentum. A priori
determined constant gi is the weight for the temperature Ti.
ST works as follows: a single simulation starts from a particular temperature
(Ti) and an attempt is made periodically to change the configuration (Xn) to another
temperature (Tj) according to a well defined transition probability by satisfying the
detailed balance condition.
( , ) ( ) ( , ') (i n n j n nP X p P i j P X p P j i (H2)
The probability of configuration Xn at temperature Ti for the expanded canonical
ensemble is,
1 1( , ) exp( ( , )) exp( ( , ) )i n n i n n i n n iP X p X p H X p g
Z Z (H3)
where pn is the momentum and Z is the partition function for the expanded canonical
ensemble. is the sum of kinetic energy (K) and potential energy (U), and
( , )n nH X p
) ( )nK p ( , ( )n n nH X p U X
A re-scaling of the momentum ( ' /n j i np T T p ) following the exchange
causes the kinetic energy to cancel out in the detailed balance equation, and the
transition probability after applying the Metropolis criterion is shown below,
( ) ( ) ( )min1, j i n j iU X g g
i jP e (H4)
where U(Xn) is the potential energy for configuration Xn, which is sampled from the
canonical ensemble at Ti. A set of weights need to be pre-determined to calculate
these transition probabilities. Without proper weighting, ST simulations will be
constrained to a subset of the temperature space and become inefficient (25, 177). It
218
was shown that weights leading the system to perform a random walk in temperature
space equal the unit-less free energies at different temperatures (24, 25).
SIMULATED TEMPERING EQUAL ACCEPTANCE RATIO (STEAR) METHOD
It is not an easy task to determine the free energy weights enabling system to perform
a random walk in temperature space. The Simulated Tempering Equal Acceptance
Ratio (STEAR) method for determining the free energy weights is adopted in this
study (49, 177). This method is based on the property that the free energy weights
leading to uniform sampling must yield the same acceptance ratios for both forward
and backward transitions from Ti to Tj as shown below.
( , ) ( , )i j j i i j i i j jP g g U P g g U (H5)
where
( , ) ( )
( , ) ( )
i j i j j i i i i
j i j i i j j j
P P g g U P U d
P P g g U P U d
j
U
U
(H6)
where Ui is the potential energy for a configuration sampled from the canonical
ensemble at temperature Ti and P(Ui) is the potential energy distribution function
(PEDF) at Ti. PEDFs for each temperature are initially estimated from short trial MD
simulations and then updated during an equilibration phase preceding the production
phase, which uses a static set of weights. By solving Eq. 7.3, we can obtain a set of
near free energy weights.
DETAILED PROCEDURE TO UPDATE THE WEIGHTS
The ST algorithm was implemented in version 3.1.4 of the GROMACS (43) molecular
dynamics simulation package modified for the Folding@Home (79) infrastructure
219
(http://folding.stanford.edu). In our ST simulations, the temperature list (T1 … Tn)
containing 56 temperatures is roughly exponentially distributed between 270 and 592
K. The detailed procedure to determine the weights using STEAR is described as
below
Obtaining the initial weights: For each of the two initial configurations (see
Figure 81), one 2 ns NVT simulation was carried out at each of 56 temperatures on a
computer cluster. Potential energies collected every 0.1 ps from the last nanosecond of
these simulations were used to get a rough approximation of the energy distribution at
each temperature. The weight (gi) that gives an equal acceptance ratio for transitions
from Ti to Ti+1 and vice versa is found using Newton’s method (See Equation (H5))
and g1 is set to zero.
Updating the weights: Once an initial set of weights has been chosen, we start
1120 ST simulations from each initial configuration on the Folding@Home distributed
computing environment. In these simulations, a temperature swap is attempted every
0.2 ps. At regular intervals (about every 300ns of simulation in total) all the new data
is collected and only new data is used to refine the approximation of the energy
distribution at each temperature. Newton’s method is then used to update the weights
to satisfy the equal acceptance ratio criterion given the new energy distributions as
shown in Equation (H4).
CONVERGENCE OF THE WEIGHTS
The weights obtained from two independent sets of ST simulations starting from
different initial configurations are converged well as shown in Table 2. The weights
converge at about 9 ns for each initial configuration. As described before, a set of
converged weights, i.e. free energy weights should induce a uniform sampling of the
temperature space. As shown in Figure 82, both sets of simulations achieve uniform
sampling at about 9ns. Thus, after about 9 ns, the weights are held static and the
simulations are continued in what is called the production phase.
220
Figure 82. Amount of sampling at different temperatures for ST simulations started from the native (top
row) and coil configurations (bottom row) computed from different segment of simulation time 0-
0.3ns, 1.2-1.5 ns, 2.7-3.0 ns, and 8.7-9.0ns are displayed. Uniform sampling is reached for both sets
of ST simulations indicating the weights are converged.
MOLECULAR DYNAMICS (MD) SIMULATION DETAILS
Our MD simulations used the nucleic acid parameters from the AMBER99 force field
(60, 267). The RNA molecule was solvated in a water box with 2543 TIP3P (263)
waters and 7 Na+ ions. The simulation system was minimized using a steepest descent
algorithm, followed by a 100ps MD simulation applying a position restraint potential
to the RNA heavy atoms. All NVT simulations were coupled to a Nose-Hoover
thermostat with a coupling constant of 0.02ps-1 (63). A cutoff of 10 Å was used for
both vdW and short range electrostatic interactions. Long-range electrostatic
interactions were treated with the Particle-Mesh Ewald (PME) method (264).
Nonbonded pair-lists were updated every 10 steps with an integration step size of 2 fs
in all simulations. All bonds were constrained using the LINCS algorithm (265).
HIERARCHICAL K-MEDOIDS CLUSTERING ALGORITHM
A hierarchical K-medoids clustering algorithm developed by Boxer, G. is used in this
study. In K-medoids clustering one starts by choosing some number of random
conformations to be generators. All remaining conformations are then assigned to the
generator that they are most similar to, thus forming a state corresponding to each
221
generator. Each generator is then updated by choosing a number of random
conformations from its corresponding state and selecting the one that is closest to
every other conformation in the state (i.e. the one that is closest to the center of the
state) as the new generator. This updating procedure may be continued for some
predetermined number of iterations or until the answer converges. The basic idea of
hierarchical clustering is to perform K-medoids clustering on the entire dataset and
then to recursively perform K-medoids clustering on each state until every state has
fewer conformations than some threshold. This threshold is set as an input parameter
for the K-medoids clustering algorithm.
Table 2. Convergence of the weights is shown for representative temperatures Δg = gj − gi obtained
from distributed computing simulations starting from a helical structure (third column) and a coil
structure (fourth column) at different temperature pairs. Differences between free energy
differences Δfji = gj/β j −gi/β i obtained from simulations starting from a helical structure and a coil
structure are displayed in the 5th column. KT at temperature i is shown in the sixth column.
Δfji(Helical)-Δfji(coil)(KJ/mol) is smaller than KT (KJ/mol) at all temperature pairs.
MARKOV STATE MODELS
A Markov model is basically a graph representing the structure and temporal
connectivity of some dataset that consists of temporally ordered observations (3, 6). In
this case, each node corresponds to a set of kinetically similar conformations. These
nodes are connected by directed edges with corresponding values equal to the
probability of transitioning between them. For the model to be Markovian, the
probability of transitioning to state j must depend solely on the previous state.
222
A Markov State Model (MSM) may also be represented by a transition
probability matrix as (also see Equ 1 in the main text)
( ) ( ) (0)P t T t P (H7)
where P(∆t) is a vector of state populations at time ∆t, T is the column-stochastic
transition probability matrix, and ∆t is the lag time (or time step). Using this
representation, the time evolution of a vector representing the population of each state
may be calculated by repeatedly left-multiplying the column vector by the transition
probability matrix. The model also has a corresponding lag time, which is effectively
the time resolution of the model. Each step, or multiplication by the transition
probability matrix, is equivalent to one lag time. For the model to be Markovian there
must be a separation of timescales. That is, equilibration within states must occur on
timescales faster than the lag time while transitions between states must occur on
timescales longer than the lag time. The key is finding an appropriate balance between
the number of states in the model and the lag time. A desirable Markov model has few
enough states that it may be understood by a person and a lag time shorter than the
timescale of the process of interest.
The eigenvalues (k) of the transition matrix each imply a time scale (k).
ln ( )kk
(H8)
where k is an eigenvalue of the transition matrix with the lag time .
The focus of the current study is thermodynamics instead of kinetics. The first
left eigenvector of the transition matrix Tij correspond to the equilibrium distribution
(6).
223
SPLITTING INTO MICROSTATES
The first step in our procedure to build an MSM is to divide all the conformations
sampled into small sets of structurally similar configurations called microstates (3, 6).
This is accomplished using the hierarchical K-medoids clustering algorithm described
in Section 3. For example, by setting the threshold for the hierarchical K-medoids
clustering to stop splitting a certain state as 2500 conformations, we divided 1.3
million conformations generated from long ASM seeding simulations into 1,597
microstates. Heavy atom RMSD is used as the distance metric, since it accounts for
both local similarities between pairs of conformations as well as global ones,. This
distance metric has also been shown to be able to distinguish between kinetically
distinct conformations. If the state population threshold is chosen to be small enough
then the conformations in one microstate may be considered to be kinetically as well
as structurally similar as it would require very few MD steps to get from one to
another. As shown in Figure 83, overlaid structures from the same microstate have
great structural similarity. Based on this assumption, one may build a microstate
Markov model by using the original data to calculate the probability of transitioning
between each pair of microstates (stored as a transition probability matrix). Because of
the small size of each microstate, this Markov model will have too many states to
provide any insight into the nature of the free energy landscape. To gain a clearer
understanding of the free energy landscape one may lump together kinetically similar
microstates to form macrostates. These macrostates comprise a new MSM that
hopefully has an appropriate separation of timescales.
224
Figure 83. Three example structures from a single microstate.
LUMPING INTO METASTABLE STATES
Lumping is done by first calculating the eigenvalues and eigenvectors of the
microstate transition probability matrix (44). The eigenvalues are related to the
timescale for interconverting between two sets of microstates while the corresponding
eigenvectors indicate which microstates constitute these two sets if the model is
Markovian at this timescale. We estimate the number of macrostates based on the gap
in the implied timescales (see Equation (H6)) of the microstate transition probability
matrix as a function of the lag time. As shown in Figure 84, there are six macro states
for the seeding simulations.
225
T
Figure 84. The largest one hundred implied timescales as a function of the lag time for (a) ST
simulations starting from the coil initial configuration. (b) The long adaptive seeding microstate
MSM.
Sets of kinetically related microstates are grouped together into macrostates
using a spectral clustering algorithm: Perron Cluster Cluster Analysis (PCCA) (45).
While generating the transition count matrix, all the recorded transitions are
independent (i.e. transitions from time t to 2t, 2t to 3t, etc). The initial lumping
calculated from this data is refined by using a Simulated Annealing (SA) scheme to
maximize the metastability (Q) of the model (6). Twenty SA runs of 20,000 steps each
are used. In each simulated annealing step, a microstate is randomly reassigned to a
new macrostate and the move is accepted using the Metropolis criterion. The
metastability is defined as the sum of the self-transition probabilities of each
macrostate ( ). Maximizing the metastability is assumed to be a good way for
maximizing the separation of timescales necessary for a valid MSM. The metastability
is shown in
1
N
iii
Q
Table 3.
N. Metastable States
Q <Pii>
ST (Native) 6 5.09 0.848
ST(Coil) 6 5.01 0.835
Seeding 6 5.61 0.935
Table 3. Metastability (Q) and average self-transition probability <Pi i> between metastable states for
the MSMs built from ST simulations and seeding simulations.
DETERMINING STATE POPULATIONS AND UNCERTAINTIES
Simulation trajectories are used to estimate transitions between different metastable
states in order to build a MSM. Such estimation induces uncertainties in any property
computed from the model including the metastable state equilibrium population we
pursued in this study. Therefore, obtaining the uncertainties is important to test the
reliability of our results. In order to estimate these uncertainties,, we employ a
226
Bayesian method introduced by Noe (251). Assuming that the system is Markovian at
the given lag time, the method defines the following stochastic model for its
parameters. The likelihood of any trajectory is simply the product of independent
transition probabilities, as a consequence of the Markov property, and the transition
probability matrix T is assigned an independent, symmetric Dirichlet prior in each
row. This is the conjugate prior for the Markov likelihood, which means that the
posterior distribution of T after observing a number of transitions has the same
functional form as the prior. This method makes the further assumption that the
system obeys detailed balance, so the distributions of T are restrained to the space of
reversible stochastic matrices. This distribution is difficult to normalize analytically,
but it may be sampled using a Markov Chain Monte Carlo (MCMC) algorithm. It was
shown (251) that the restriction to reversible matrices greatly reduces the uncertainty
of many thermodynamic properties, which is why it was deemed necessary in our
study. Using this method, we were able to sample from the posterior distribution of T,
given our simulation data, to obtain stable Monte Carlo estimates of the deviations of
equilibrium populations.
A SIMPLE MODEL OF NON-ARRHENIUS, METASTABLE DYNAMICS
SIMPLE POTENTIAL
GE algorithms attempt to overcome the sampling problem by inducing a random walk
in temperature space, where high temperatures help systems cross energetic barriers.
However, it has been shown that GE simulations will provide little improvement when
the folding kinetics are non-Arrhenius, and the dominant barriers are entropic at high
temperatures. In order to demonstrate the efficiency of the ASM in comparison with
the GE algorithms, we introduce a model 2D potential to fully contrast the
convergence of equilibrium statistics from the different algorithms. The model is
based on a discrete-state system introduced by Zwanzig (252) as a simple model for
protein folding, which is similar in sprit to continuous-space models used to study
227
anti-Arrhenius dynamics by the Levy group (241). These models define an energy
surface reminiscent of a golf-course, which is almost everywhere flat with some bias
toward the folded state and has a sharp decline near the folded state. On the other
hand, the degeneracy of the microstates increases sharply as we move away from the
folded conformation, providing an entropic advantage that stabilizes the unfolded
macrostate at higher temperatures.
The system of Zwanzig (252) was modified by introducing an additional,
uncoupled degree of freedom, which has the effect of creating intermediate states
between the folded and unfolded states. The energy as a function of the two
independent parameters S and R is
S,R S0 R0 R0 S0E =SU+RU- - +(2- ) (H9)
where 0,...., sS N and 0,...., RR N . The constant U determines the slope of the
energy function as we move away from the folded state along each coordinate;
represents the drop in energy when one of the coordinates becomes 0, while is the
depth of the energy well of the completely folded state, where both S and R equal 0.
The degeneracy of each microstate is given by:
,S RS R
S R
N Ng
SS
(H10)
With all this information, it is straightforward to analytically derive the partition
function
( , ),
0 0
2( (1 ) 1)( (1 ) 1)
S R
S R
N NE S R
S RS R
N NU U
Q e g
e e e e e e
(H11)
The equilibrium probability of each of the (NR+1)(NS+1)microstates is now easy to
compute by
228
( , )
( , )E S Re
P S RQ
(H12)
In the current study, we select parameters =4, =100, =1.5, U=1, and NR =
NS = 7 for our purpose of mimicking the non-Arrehnius folding kinetics. The
Potential of Mean Force (PMF) ( ln ( , )G P S R ) at a range of temperatures are
displayed in Figure 85. PMF plots suggest 4 metastable macrostates, shown in Figure
85 as separated by black dashed lines (the state decomposition will be discussed in the
next paragraph). The folded state where S = R = 0 (state 1), the unfolded state where
S>0 and R>0 (state 4), and two intermediate states where either S = 0 (State 2) or R =
0 (State 3).
Figure 85. Potential of Mean Force (PMF) for the simple potential at (1/KT) a. 0.995, b. 0.652, and c.
0.456. In part a, four metastable macrostates are separated by the dashed black lines and labled.
As expected, the free energy of the folded state decreases as we increase the
temperature, while the opposite is true of the unfolded state. This is also shown in
Figure 86 where the equilibrium populations of four macrostates are plotted as a
function of =1/kT. The populations of intermediate states 2 and 3 have low
229
populations at both low and high temperatures, but reach the maximum values at
medium temperatures with 0.65 .
Figure 86. Populations of four macrostates as function of =1/kT.
The potential was equipped with a discrete-time, Metropolis Hastings Monte
Carlo dynamics, where the proposal probabilities are proportional to the state
degeneracy for states where at least one of S and R change by 1, and zero for all
others. A Markovian transition probability matrix T was computed at each
temperatures, from which we obtained evidence for non-Arrhenius behavior and
metastability. The non-Arrhenius behavior can be seen in Figure 87 where we plot the
folding and unfolding rates at a function of temperature, computed as the inverse of
the mean first passage times between the folded and unfolded states. The mean first
passage times are computed using the method described by Singhal et.al. (37). The
unfolding rate increases with temperature. However, the folding rate decreases with
temperature due to the high entropic barriers for refolding at high temperatures.
Metastability for this system is confirmed by the large gap between the third and
fourth timescales implied by T as shown in Figure 88. At all temperatures, the third
largest timescale is at least a factor of 5 greater than the fourth implied timescale.
230
Therefore, we confirm that there is a separation of timescales for this system, and it
has four metastable macrostates. The first 3 implied timescales correspond to the
transitions between macrostates, while other shorter implied timescales correspond to
transitions within macrostates. State decomposition can be obtained by spectra
clustering algorithm Perron Cluster Cluster Analysis (PCCA) (45) and the resulting
definition of the four metastable states are shown in Figure 85 (a).
Figure 87. Folding (black) and unfolding (red) rates are plotted as a function of =1/kT.
COMPARING EFFICIENCY OF ASM AND GE USING THE SIMPLE POTENTIAL.
To test our hypothesis that GE algorithms, in particular Simulated Tempering (ST),
would exhibit a slower rate of convergence for equilibrium statistics than ASM, we
simulated 1000 trajectories of steps using each method. An optimal list of 10
temperatures with = 1.1, 0.995, 0.939, 0.89, 0.827, 0.652, 0.554, 0.519, 0.491, and
0.456 are selected for ST to obtain acceptance ratios bigger than 40% between all
neighbouring temperatures. The weights (g
66 10
i) are chosen analytically from the partition
function (177) to enable the system to uniformly sample every temperature.
231
ln ( )ig Q (H13)
An equal number of trajectories was started from each temperature, with
temperature change proposals done every 10 steps of simulation. Two independent
sets of ST simulations are performed with initial state 0 and 4 respectively.
Figure 88. Logarithms of the implied timescales as function of for the 2D potential are displayed.
The three slowest timescales are plotted using up triangle, down triangle, and cross points
respectively.
For ASM, we simulated 250 trajectories from each of the 4 macrostates at a
constant temperature of = 0.995, at which the folded state is the dominant state in
order to mimic the situation at physiological temperatures.
The convergence of the equilibrium populations from ST was analyzed in the
following way. For a set number of trajectories, we take a window of 50,000 steps,
and compute the fraction of the configurations at a certain metastable state and
temperature = 0.995 within this window. By bootstrapping this estimator 100 times,
we can determine distribution of the state populations as a function of simulation time
232
(see Figure 89). Populations obtained form the two independent sets of ST simulations
are converged between and steps. 52.5 10 53 10
Figure 89. Populations computed from Simulated Temperating (ST) simulations for four metastable
states of the are plotted as a function of length of the simulation. The reference populaiton is shown
in the solid lines and 1000 trajectories are used for this calculaiton. The error bars are the standard
derivation obtained from bootstrapping 100 times with replacement.
Similarly for ASM, we obtain a distribution for the equilibrium populations
with different trajectory length for a certain number of trajectories, which is computed
by a Bayesian method (251). As shown in Figure 86, it only takes about steps
for ASM to converge to the correct populations, which is much more efficient than
ST. The populations in
44 10
Figure 90 are computed using a lag time of 1/3 of the trajectory
length. However, we show in that the populations are almost invariant to the lag time
if it is longer than about 1/8 of the trajectory. We note that one has to choose a proper
lag time in order to get good estimate of the populations. A good lag time has to be
small enough so that there are enough transition counts, but not too small to have
many correlated transition counts. In our RNA hairpin example, we use a small lag
233
time but only a few transition counts are taken from each trajectory to make sure we
only consider independent transition events. In that case, we can still estimate
thermodynamic properties accurately even though the model is not Markovian under
the lag time used.
Figure 90. Populations computed from Adaptive Seeding Method (ASM) for four metastable states of
the are plotted as a function of length of the simulation. The reference populaiton is shown in the
solid lines and 1000 trajectories are used for this calculation. The lag time is selected as 1/3 of the
length of the simulation. The error bars are standard derivation obtained from a Bayesian method
(See section 2.5.3 for details).
To compare the efficiency of ASM and ST as a function of length and number
of trajectories, we define a criterion for the convergence as following: the probability
that the estimated populations for all states are within 5% of the actual equilibrium
populations is bigger than 80%. The population distributions are computed the same
way as in Figure 89 for ST and in Figure 90 for ASM. As shown in Figure 92, ASM is
much more efficient than ST, and can reach the convergence using 4-7 times shorter
simulations than ST. In addition, the efficiency of ST will not increase with the
number of trajectories after 200, while the efficiency of ASM keeps increasing with
number of trajectories up to 600. We think ideally the length of the seeding
234
simulations should lie in the major gap of the implied timescales, such that they are
longer than the slowest intra-macrostate equilibration time to minimize the model
error due to non-Markovian effects. In the current system, the minimum length of the
simulations (~ ) is indeed between 3rd ( ) and 4th ( ) slowest
implied timescales. There is evidence from the RNA hairpin example and previous
work on a water dewetting transition in a carbon nanotube (7) that these requirements
for the lag time may be relaxed for real systems, where the separation of timescales is
less evident than in the model system studied here. . Additionally, the number of
seeding simulations has to be big enough to reduce the statistical error to a satisfactory
level.
45 10 31.61 10 49.58 10
Figure 91. Populations computed from ASM simulations for four metastable states as a function of lag
time.
235
Figure 92. Number of steps taken to reach the convergence as a function of number of trajs.
236
BIBLIOGRAPHY
1. Schütte C, Fischer A, Huisinga W, & Deuflhard P (1999) A direct approach to conformational dynamics based on hybrid Monte Carlo. J Comput Phys 151:146–168.
2. Bowman GR, Huang X, & Pande VS (2010) Network models for molecular kinetics and their initial applications to human health. Cell Res 20:622-630.
3. Noe F & Fischer S (2008) Transition networks for modeling the kinetics of conformational change in macromolecules. Curr Opin Struct Biol 18:154-162.
4. Bowman GR, Beauchamp KA, Boxer G, & Pande VS (2009) Progress and challenges in the automated construction of Markov state models for full protein systems. J Chem Phys 131:124101.
5. Noe F, Schutte C, Vanden-Eijnden E, Reich L, & Weikl TR (2009) Constructing the equilibrium ensemble of folding pathways from short off-equilibrium simulations. Proc Natl Acad Sci U S A 106:19011-19016.
6. Chodera JD, Singhal N, Pande VS, Dill KA, & Swope WC (2007) Automatic discovery of metastable states for the construction of Markov models of macromolecular conformational dynamics. J Chem Phys 126:155101.
7. Sriraman S, Kevrekidis IG, & Hummer G (2005) Coarse nonlinear dynamics and metastability of filling-emptying transitions: Water in carbon nanotubes. Phys. Rev. Lett. 95:130603.
8. Gfeller D, De Los Rios P, Caflisch A, & Rao F (2007) Complex network analysis of free-energy landscapes. Proc Natl Acad Sci U S A 104:1817-1822.
9. Schutte C (1999) Conformational Dynamics: Modeling, Theory, Algorithm, and Application to Biomolecules. (thesis, Freie Universitat Berlin).
10. Bowman GR, Huang X, & Pande VS (2009) Using generalized ensemble simulations and Markov state models to identify conformational states. Methods 49:197-201.
11. Sriraman S, Kevrekidis LG, & Hummer G (2005) Coarse master equation from Bayesian analysis of replica molecular dynamics simulations. J Phys Chem B 109:6479-6484.
12. Huang X, Bowman GR, Bacallado S, & Pande VS (2009) Rapid equilibrium sampling initiated from nonequilibrium data. Proc Natl Acad Sci U S A 106:19765-19769.
13. Huang X, et al. (2010) Constructing multi-resolution Markov state models (MSMs) to elucidate RNA hairpin folding mechanisms. Pac Symp Biocomput 15:228-239.
237
14. Noe F, Horenko I, Schutte C, & Smith JC (2007) Hierarchical analysis of conformational dynamics in biomolecules: transition networks of metastable states. J Chem Phys 126:155102.
15. Sarich M, Noe F, & Schutte C (2010) On the approximation quality of Markov state models. SIAM Multiscale Model Simul, in press.
16. Bowman GR & Pande VS (2010) Protein folded states are kinetic hubs. Proc Natl Acad Sci U S A 107:10890-10895.
17. Rao F & Caflisch A (2004) The protein folding network. J Mol Biol 342:299-306.
18. Bowman GR, Ensign DL, & Pande VS (2010) Enhanced modeling via network theory: adaptive sampling of Markov state models. J Chem Theory Comput 6:787-794.
19. Hinrichs NS & Pande VS (2007) Calculation of the distribution of eigenvalues and eigenvectors in Markovian state models for molecular dynamics. J Chem Phys 126:244101.
20. Roblitz S (2008) Statistical error estimation and grid-free hierarchical refinement in conformation dynamics. (thesis, Freie Universitat Berlin).
21. Mitsutake A, Sugita Y, & Okamoto Y (2001) Generalized-ensemble algorithms for molecular simulations of biopolymers. Biopolymers 60:96-123.
22. Hansmann UH & Okamoto Y (1999) New Monte Carlo algorithms for protein folding. Curr. Opin. Struct. Biol. 9:177-183.
23. Sugita Y & Okamoto Y (1999) Replica-exchange molecular dynamics method for protein folding. Chem. Phys. Lett. 314:141-151.
24. Lyubartsev AP, Martsinovski AA, Shevkunov SV, & Vorontsov-Velyaminov PN (1992) New approach to Monte Carlo calculation of the free energy: Method of expanded ensembles. J. Chem. Phys. 96:1776-1783.
25. Marinari E & Parisi G (1992) Simulated Tempering: a New Monte Carlo Scheme. Euro. Lett. 19:451-458.
26. Zhou R, Berne BJ, & Germain R (2001) The free energy landscape for beta hairpin folding in explicit water. Proc. Natl. Acad. Sci. USA 98:14931-14936.
27. Rhee YM & Pande VS (2003) Multiplexed-replica exchange molecular dynamics method for protein folding simulation. Biophysical journal 84:775-786.
28. Nymeyer H & Garcia AE (2003) Simulation of the folding equilibrium of alpha-helical peptides: a comparison of the generalized Born approximation with explicit solvent. Proc. Natl. Acad. Sci. USA 100:13934-13939.
29. Zhou R (2003) Trp-cage: folding free energy landscape in explicit water. Proc. Natl. Acad. Sci. USA 100:13280-13285.
30. Krivov SV & Karplus M (2004) Hidden complexity of free energy surfaces for peptide (protein) folding. Proc. Natl. Acad. Sci. U.S.A. 101:14766-14770.
31. Karpen ME, Tobias DJ, & Brooks CL, 3rd (1993) Statistical clustering techniques for the analysis of long molecular dynamics trajectories: analysis of 2.2-ns trajectories of YPGDV. Biochemistry 32:412-420.
238
32. Shao JY, Tanner SW, Thompson N, & Cheatham TE (2007) Clustering molecular dynamics trajectories: 1. Characterizing the performance of different clustering algorithms. J. Chem. Theory Comp. 3:2312-2334.
33. Buchete NV & Hummer G (2008) Coarse master equations for peptide folding dynamics. J Phys Chem B 112:6057-6069.
34. Swope WC, Pitera JW, & Suits F (2004) Describing protein folding kinetics by molecular dynamics simulations. 1. Theory. J Phys Chem B 108:6571-6581.
35. Frauenfelder H, Sligar SG, & Wolynes PG (1991) The energy landscapes and motions of proteins. Science 254:1598-1603.
36. Yang WY & Gruebele M (2004) Detection-dependent kinetics as a probe of folding landscape microstructure. J Am Chem Soc 126:7758-7759.
37. Singhal N, Snow CD, & Pande VS (2004) Using path sampling to build better Markovian state models: predicting the folding rate and mechanism of a tryptophan zipper beta hairpin. J. Chem. Phys. 121:415-425.
38. Elmer S, Park S, & Pande VS (2005) Foldamer dynamics expressed via Markov State Models: 2. Explicit solvent molecular dynamics simulations in acetonitrile, chloroform, methanol, and water. J. Chem. Phys. 122:124908.
39. Jayachandran G, Vishal V, & Pande VS (2006) Folding Simulations of the Villin Headpiece in All-Atom Detail. J. Chem. Phys. 124:164902.
40. Kelley NW, Vishal V, Krafft GA, & Pande VS (2008) Simulating oligomerization at experimental concentrations and long timescales: A Markov state model approach. J Chem Phys 129:214707.
41. Gonzalez T (1985) Clustering to minimize the maximum intercluster distance. Theo. Comp. Sci. 38:293-306.
42. Dasgupta S & Long PM (2005) Performance guarantees for hierarchical clustering. J. Comput. System Sci. 70:555-569.
43. Lindahl E, B. Hess, and D. van der Spoel. (2001) GROMACS 3.0: a package for molecular simulation and trajectory analysis. J. Mol. Modeling. 7:306-317.
44. Deuflhard P & Weber M (2005) Robust Perron cluster analysis in conformation dynamics. Lin. Alg. Appl. 398:161-184.
45. Deuflhard P, Huisinga W, Fischer A, & Schütte C (2000) Identification of almost invariant aggregates in reversible nearly uncoupled Markov chains. Lin. Alg. Appl. 315:39-59.
46. Anfinsen CB, Haber E, Sela M, & White FH, Jr. (1961) The kinetics of formation of native ribonuclease during oxidation of the reduced polypeptide chain. Proc Natl Acad Sci USA 47:1309-1314.
47. Klein WL, Stine WB, Jr., & Teplow DB (2004) Small assemblies of unmodified amyloid beta-protein are the proximate neurotoxin in Alzheimer's disease. Neurobiol Aging 25:569-580.
48. Simons KT, Kooperberg C, Huang E, & Baker D (1997) Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol 268:209-225.
49. Bowman GR & Pande VS (2009) Simulated tempering yields insight into the low-resolution Rosetta scoring functions. Proteins 74:777-788.
239
50. Bolhuis PG, Dellago C, & Chandler D (2000) Reaction coordinates of biomolecular isomerization. Proc Natl Acad Sci U S A 97:5877-5882.
51. Du R, Pande VS, Grosberg AY, Tanaka T, & Shakhnovich ES (1998) On the transition coordinate for protein folding. J Chem Phys 108:34-350.
52. Bowman GR, et al. (2008) Structural insight into RNA hairpin folding intermediates. J Am Chem Soc 130:9676-9678.
53. Dill KA, Ozkan SB, Shell MS, & Weikl TR (2008) The protein folding problem. Annu Rev Biophys 37:289-316.
54. Chodera JD, Swope WC, Pitera JW, & Dill KA (2006) Long-timescale protein folding dynamics from short-time molecular dynamics simulations. Multi Mod Simul 5:1214–1226.
55. Yang S, Banavali NK, & Roux B (2009) Mapping the conformational transition in Src activation by cumulating the information from multiple molecular dynamics trajectories. Proc Natl Acad Sci U S A 106:3776-3781.
56. Kubelka J, Chiu TK, Davies DR, Eaton WA, & Hofrichter J (2006) Sub-microsecond protein folding. J Mol Biol 359:546-553.
57. Chiu TK, et al. (2005) High-resolution x-ray crystal structures of the villin headpiece subdomain, an ultrafast folding protein. Proc Natl Acad Sci USA 102:7517-7522.
58. Ensign DL, Kasson PM, & Pande VS (2007) Heterogeneity even at the speed limit of folding: large-scale molecular dynamics study of a fast-folding variant of the villin headpiece. J Mol Biol 374:806-816.
59. Berendsen HJC, Vanderspoel D, & Vandrunen R (1995) Gromacs - a Message-Passing Parallel Molecular-Dynamics Implementation. Computer Physics Communications 91:43-56.
60. Wang JM, Cieplak P, & Kollman PA (2000) How well does a restrained electrostatic potential (RESP) model perform in calculating conformational energies of organic and biological molecules? Journal of computational chemistry 21:1049-1074.
61. Ryckaert JP, Ciccotti G, & Berendsen HJC (1977) Numerical Integration of the Cartesian Equations of Motion of a System with Constraints: Molecular Dynamics of n-Alkanes. J. Comp. Phys. 23:327-341.
62. Miyamoto S & Kollman PA (1992) Settle - an Analytical Version of the Shake and Rattle Algorithm for Rigid Water Models. Journal of computational chemistry 13:952-962.
63. Hoover W (1985) Canonical dynamics: Equilibrium phase-space distributions. Phys. Rev. A 31:1695-1697.
64. Nose S & Klein ML (1983) Constant Pressure Molecular-Dynamics for Molecular-Systems. Molecular Physics 50:1055-1076.
65. Nose S (1984) A Molecular-Dynamics Method for Simulations in the Canonical Ensemble. Molecular Physics 52:255-268.
66. Parrinello M & Rahman A (1981) Polymorphic Transitions in Single-Crystals - a New Molecular-Dynamics Method. Journal of Applied Physics 52:7182-7190.
240
67. Humphrey W, Dalke A, & Schulten K (1996) VMD: visual molecular dynamics. J Mol Graph 14:33-38.
68. Schultheis V, Hirschberger T, Carstens H, & Tavan P (2005) Extracting Markov Models of Peptide Conformational Dynamics from Simulation Data. JCTC 1:515-526.
69. Bolhuis PG, Chandler D, Dellago C, & Geissler PL (2002) Transition path sampling: throwing ropes over rough mountain passes, in the dark. Annu Rev Phys Chem 53:291-318.
70. Dill KA, Ozkan SB, Weikl TR, Chodera JD, & Voelz VA (2007) The protein folding problem: when will it be solved? Curr Opin Struct Biol 17:342-346.
71. Plaxco KW, Simons KT, & Baker D (1998) Contact order, transition state placement and the refolding rates of single domain proteins. J Mol Biol 277:985-994.
72. Yang WY & Gruebele M (2003) Folding at the speed limit. Nature 423:193-197.
73. Kubelka J, Hofrichter J, & Eaton WA (2004) The protein folding 'speed limit'. Curr Opin Struct Biol 14:76-88.
74. Udgaonkar JB (2008) Multiple routes and structural heterogeneity in protein folding. Annu Rev Biophys 37:489-510.
75. Pitera JW & Swope W (2003) Understanding folding and design: replica-exchange simulations of "Trp-cage" miniproteins. Proc Natl Acad Sci U S A 100:7587-7592.
76. Zagrovic B, Snow CD, Shirts MR, & Pande VS (2002) Simulation of folding of a small alpha-helical protein in atomistic detail using worldwide-distributed computing. J Mol Biol 323:927-937.
77. Ensign DL & Pande VS (2009) The Fip35 WW domain folds with structural and mechanistic heterogeneity in molecular dynamics simulations. Biophys J 96:L53-55.
78. Horng JC, Moroz V, & Raleigh DP (2003) Rapid cooperative two-state folding of a miniature alpha-beta protein and design of a thermostable variant. J Mol Biol 326:1261-1270.
79. Shirts M & Pande VS (2000) COMPUTING: Screen Savers of the World Unite! Science 290:1903-1904.
80. Friedrichs MS, et al. (2009) Accelerating molecular dynamic simulation on graphics processing units. J Comput Chem 30:864-872.
81. Onufriev A, Bashford D, & Case DA (2004) Exploring protein native states and large-scale conformational changes with a modified generalized born model. Proteins 55:383-394.
82. Shell MS, Ritterson R, & Dill KA (2008) A test on peptide stability of AMBER force fields with implicit solvation. J Phys Chem B 112:6878-6886.
83. Hoffman DW, et al. (1994) Crystal structure of prokaryotic ribosomal protein L9: a bi-lobed RNA-binding protein. EMBO J 13:205-212.
84. Shirts MR & Pande VS (2001) Mathematical analysis of coupled parallel simulations. Phys Rev Lett 86:4983-4987.
241
85. Ensign DL & Pande VS (2009) Bayesian single-exponential kinetics in single-molecule experiments and simulations. J Phys Chem B 113:12410-12423.
86. Panchenko AR, Luthey-Schulten Z, & Wolynes PG (1996) Foldons, protein structural modules, and exons. Proc Natl Acad Sci U S A 93:2008-2013.
87. Metzner P, Schutte C, & Vanden-Eijnden E (2009) Transition Path Theory for Markov Jump Processes. Multiscale Modeling & Simulation 7:1192-1219.
88. Weikl TR (2008) Loop-closure principles in protein folding. Archives of Biochemistry and Biophysics 469:67-75.
89. Snow CD, Rhee YM, & Pande VS (2006) Kinetic definition of protein folding transition state ensembles and reaction coordinates. Biophys J 91:14-24.
90. Uversky VN (2009) Intrinsic disorder in proteins associated with neurodegenerative diseases. Front Biosci 14:5188-5238.
91. Bowman GR & Pande VS (2009) The roles of entropy and kinetics in structure prediction. PLoS One 4:e5840.
92. Ozkan SB, Wu GA, Chodera JD, & Dill KA (2007) Protein folding by zipping and assembly. Proc Natl Acad Sci U S A 104:11987-11992.
93. Voelz VA, Bowman GR, Beauchamp KA, & Pande VS (2010) Molecular simulation of ab initio protein folding for a millisecond folder NTL9(1-39). J Am Chem Soc 132:1526-1528.
94. Jackson SE & Fersht AR (1991) Folding of chymotrypsin inhibitor 2. 1. Evidence for a two-state transition. Biochemistry 30:10428-10435.
95. Bryngelson JD, Onuchic JN, Socci ND, & Wolynes PG (1995) Funnels, pathways, and the energy landscape of protein folding: a synthesis. Proteins 21:167-195.
96. Barrick D (2009) What have we learned from the studies of two-state folders, and what are the unanswered questions about two-state protein folding? Phys Biol 6:15001.
97. Spudich GM, Miller EJ, & Marqusee S (2004) Destabilization of the Escherichia coli RNase H kinetic intermediate: switching between a two-state and three-state folding mechanism. J Mol Biol 335:609-618.
98. Radford SE, Dobson CM, & Evans PA (1992) The folding of hen lysozyme involves partially structured intermediates and multiple pathways. Nature 358:302-307.
99. Kamagata K, Sawano Y, Tanokura M, & Kuwajima K (2003) Multiple parallel-pathway folding of proline-free Staphylococcal nuclease. J Mol Biol 332:1143-1153.
100. Ma H & Gruebele M (2006) Low barrier kinetics: dependence on observables and free energy surface. J Comput Chem 27:125-134.
101. Wales DJ & Scheraga HA (1999) Global optimization of clusters, crystals, and biomolecules. Science 285:1368-1372.
102. Wetlaufer DB (1973) Nucleation, rapid folding, and globular intrachain regions in proteins. Proc Natl Acad Sci U S A 70:697-701.
103. Myers JK & Oas TG (2001) Preorganized secondary structure as an important determinant of fast protein folding. Nat Struct Biol 8:552-558.
242
104. Krishna MM, Maity H, Rumbley JN, Lin Y, & Englander SW (2006) Order of steps in the cytochrome C folding pathway: evidence for a sequential stabilization mechanism. J Mol Biol 359:1410-1419.
105. Volk M, et al. (1997) Peptide Conformational Dynamics and Vibrational Stark Effects Following Photoinitiated Disulfide Cleavage. J Chem Phys 101:8607.
106. Sabelko J, Ervin J, & Gruebele M (1999) Observation of strange kinetics in protein folding. Proc Natl Acad Sci U S A 96:6031-6036.
107. Liu F & Gruebele M (2007) Tuning lambda6-85 towards downhill folding at its melting temperature. J Mol Biol 370:574-584.
108. Liu F, et al. (2009) A one-dimensional free energy surface does not account for two-probe folding kinetics of protein alpha(3)D. J Chem Phys 130:061101.
109. Ghosh K & Dill KA (2007) The ultimate speed limit to protein folding is conformational searching. J Am Chem Soc 129:11920-11927.
110. Betancourt MR & Onuchic JN (1995) Kinetics of protein like models: The energy landscape factors that determine folding. J Chem Phys 103:773.
111. Cho SS, Levy Y, & Wolynes PG (2006) P versus Q: structural reaction coordinates capture protein folding on smooth landscapes. Proc Natl Acad Sci U S A 103:586-591.
112. Leopold PE, Montal M, & Onuchic JN (1992) Protein folding funnels: a kinetic approach to the sequence-structure relationship. Proc Natl Acad Sci U S A 89:8721-8725.
113. Nettels D, Gopich IV, Hoffmann A, & Schuler B (2007) Ultrafast dynamics of protein collapse from single-molecule photon statistics. Proc Natl Acad Sci U S A 104:2655-2660.
114. Waldauer SA, et al. (2008) Ruggedness in the folding landscape of protein L. HFSP J 2:388-395.
115. Voelz VA, Singh VR, Wedemeyer WJ, Lapidus LJ, & Pande VS (2010) Unfolded state dynamics and structure of protein L characterized by simulation and experiment. J Am Chem Soc 132:4702-4709.
116. Watts DJ & Strogatz SH (1998) Collective dynamics of 'small-world' networks. Nature 393:440-442.
117. Barabasi AL & Albert R (1999) Emergence of scaling in random networks. Science 286:509-512.
118. Dill KA & Chan HS (1997) From Levinthal to pathways to funnels. Nat Struct Biol 4:10-19.
119. Milgram S (1967) The small world problem. Psychol Today 1:61-67. 120. Chung HS, Louis JM, & Eaton WA (2009) Experimental determination of
upper bound for transition path times in protein folding from single-molecule photon-by-photon trajectories. Proc Natl Acad Sci U S A 106:11837-11844.
121. Fersht AR (2002) On the simulation of protein folding by short time scale molecular dynamics and distributed computing. Proc Natl Acad Sci U S A 99:14122-14125.
243
122. Saven JG, Wang J, & Wolynes PG (1994) Kinetics of Protein-Folding - the Dynamics of Globally Connected Rough Energy Landscapes with Biases. J Chem Phys 101:11037-11043.
123. Wang J, Saven JG, & Wolynes PG (1996) Kinetics in a globally connected, correlated random energy model. J Chem Phys 105:11276-11284.
124. Du R, Pande VS, Grosberg AY, Tanaka T, & Shakhnovich ES (1999) On the role of conformational geometry in protein folding. J Chem Phys 111:10375.
125. Andrec M, Felts AK, Gallicchio E, & Levy RM (2005) Protein folding pathways from replica exchange simulations and a kinetic network model. Proc Natl Acad Sci U S A 102:6801-6806.
126. Kim PS & Baldwin RL (1990) Intermediates in the folding reactions of small proteins. Annu Rev Biochem 59:631-660.
127. Shan B, Eliezer D, & Raleigh DP (2009) The unfolded state of the C-terminal domain of the ribosomal protein L9 contains both native and non-native structure. Biochemistry 48:4707-4719.
128. Kuzmenkina EV, Heyes CD, & Nienhaus GU (2005) Single-molecule Forster resonance energy transfer study of protein dynamics under denaturing conditions. Proc Natl Acad Sci U S A 102:15471-15476.
129. McLeish TC (2005) Protein folding in high-dimensional spaces: hypergutters and the role of nonnative interactions. Biophys J 88:172-183.
130. Pabo CO & Lewis M (1982) The operator-binding domain of lambda repressor: structure and DNA recognition. Nature 298:443-447.
131. Clarke ND, Beamer LJ, Goldberg HR, Berkower C, & Pabo CO (1991) The DNA binding arm of lambda repressor: critical contacts from a flexible region. Science 254:267-270.
132. Huang GS & Oas TG (1995) Submillisecond folding of monomeric lambda repressor. Proc Natl Acad Sci U S A 92:6878-6882.
133. Burton RE, Huang GS, Daugherty MA, Calderone TL, & Oas TG (1997) The energy landscape of a fast-folding protein mapped by Ala-->Gly substitutions. Nat Struct Biol 4:305-310.
134. Ghaemmaghami S, Word JM, Burton RE, Richardson JS, & Oas TG (1998) Folding kinetics of a fluorescent variant of monomeric lambda repressor. Biochemistry 37:9179-9185.
135. Liu F, Gao YG, & Gruebele M (2010) A survey of lambda repressor fragments from two-state to downhill folding. J Mol Biol 397:789-798.
136. Larios E, Pitera JW, Swope W, & Gruebele M (2006) Correlation of early orientational ordering of engineered λ6–85 structure with kinetics and thermodynamics Chem Phys 323:45-53.
137. Yang WY & Gruebele M (2004) Folding lambda-repressor at its speed limit. Biophys J 87:596-608.
138. Allen LR, Krivov SV, & Paci E (2009) Analysis of the free-energy surface of proteins from reversible folding simulations. PLoS Comput Biol 5:e1000428.
244
139. Yang WY, Larios E, & Gruebele M (2003) On the extended beta-conformation propensity of polypeptides at high temperature. J Am Chem Soc 125:16220-16227.
140. Hoffmann A, et al. (2007) Mapping protein collapse with single-molecule fluorescence and kinetic synchrotron radiation circular dichroism spectroscopy. Proc Natl Acad Sci U S A 104:105-110.
141. DeCamp SJ, Naganathan AN, Waldauer SA, Bakajin O, & Lapidus LJ (2009) Direct observation of downhill folding of lambda-repressor in a microfluidic mixer. Biophys J 97:1772-1777.
142. Ma H & Gruebele M (2005) Kinetics are probe-dependent during downhill folding of an engineered lambda6-85 protein. Proc Natl Acad Sci U S A 102:2283-2287.
143. Munoz V & Serrano L (1994) Elucidating the folding problem of helical peptides using empirical parameters. Nat Struct Biol 1:399-409.
144. Portman J, Takada S, & Wolynes PG (1998) Variational Theory for Site Resolved Protein Folding Free Energy Surfaces. Phys Rev Lett 81:5237-5240.
145. Burton RE, Myers JK, & Oas TG (1998) Protein Folding Dynamics: Quantitative Comparison between Theory and Experiment. Biochemistry 37:5337–5343.
146. Pande VS (2010) A simple theory of protein folding kinetics. Phys Rev Lett, in submssion.
147. Liu F, et al. (2008) An experimental survey of the transition between two-state and downhill protein folding scenarios. Proc Natl Acad Sci U S A 105:2369-2374.
148. He Y, Yeh DC, Alexander P, Bryan PN, & Orban J (2005) Solution NMR structures of IgG binding domains with artificially evolved high levels of sequence identity but different folds. Biochemistry 44:14055-14061.
149. Rhee YM & Pande VS (2006) On the role of chemical detail in simulating protein folding kinetics. J Chem Phys 323:66-77.
150. Bradley P, Misura KM, & Baker D (2005) Toward high-resolution de novo structure prediction for small proteins. Science 309:1868-1871.
151. Das R, et al. (2007) Structure prediction for CASP7 targets using extensive all-atom refinement with Rosetta@home. Proteins 69:118-128.
152. Klepeis JL, Lindorff-Larsen K, Dror RO, & Shaw DE (2009) Long-timescale molecular dynamics simulations of protein structure and function. Curr Opin Struct Biol 19:120-127.
153. Geyer CJ (1992) Practical Markov Chain Monte Carlo. Stat. Sci. 7:473-511. 154. King RD, et al. (2009) The automation of science. Science 324:85-89. 155. Pande VS, et al. (2003) Atomistic protein folding simulations on the
submillisecond time scale using worldwide distributed computing. Biopolymers 68:91-109.
156. Faradjian AK & Elber R (2004) Computing time scales from reaction coordinates by milestoning. J Chem Phys 120:10880-10889.
245
157. Rogal J & Bolhuis PG (2008) Multiple state transition path sampling. J Chem Phys 129:224107
158. MacKay DJC (2003) Information theory, inference, and learning algorithms (Cambridge University Press, Cambridge, UK ; New York) p 34.
159. Shell MS (2008) The relative entropy is fundamental to multiscale and inverse thermodynamic problems. J. Chem. Phys. 129:144108
160. Cover TM & Thomas JA (2006) Elements of information theory (Wiley-Interscience, Hoboken, N.J.) 2nd Ed pp xxiii, 748 p.
161. Singhal N & Pande VS (2005) Error analysis and efficient sampling in Markovian state models for molecular dynamics. J Chem Phys 123:204909.
162. Baker D (2006) Prediction and design of macromolecular structures and interactions. Philos Trans R Soc Lond B Biol Sci 361:459-463.
163. Misura KM & Baker D (2005) Progress and challenges in high-resolution refinement of protein structure models. Proteins 59:15-29.
164. Schueler-Furman O, Wang C, Bradley P, Misura K, & Baker D (2005) Progress in modeling of protein structures and interactions. Science 310:638-642.
165. Kuhlman B, et al. (2003) Design of a novel globular protein fold with atomic-level accuracy. Science 302:1364-1368.
166. Kortemme T, et al. (2004) Computational redesign of protein-protein interaction specificity. Nat Struct Mol Biol 11:371-379.
167. Ashworth J, et al. (2006) Computational redesign of endonuclease DNA binding and cleavage specificity. Nature 441:656-659.
168. Nauli S, Kuhlman B, & Baker D (2001) Computer-based redesign of a protein folding pathway. Nat Struct Biol 8:602-605.
169. Nauli S, et al. (2002) Crystal structures and increased stabilization of the protein G variants with switched folding pathways NuG1 and NuG2. Protein Sci 11:2924-2931.
170. Qian B, et al. (2007) High-resolution structure prediction and the crystallographic phase problem. Nature 450:259-264.
171. Rothlisberger D, et al. (2008) Kemp elimination catalysts by computational enzyme design. Nature 453:190-195.
172. Simons K, et al. (1999) Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins. Proteins 34:82–95.
173. Shortle D, Simons K, & Baker D (1998) Clustering of low-energy conformations near the native structures of small proteins. Proc Natl Acad Sci USA 95:11158–11162.
174. Lee M, Tsai J, Baker D, & PA K (2001) Molecular dynamics in the endgame of protein structure prediction. J Mol Biol 313:417–430.
175. Chivian D, et al. (2005) Prediction of CASP6 structures using automated robetta protocols. Proteins 61:157–166.
176. Rohl C, Strauss C, Misura K, & Baker D (2004) Protein structure prediction using rosetta. Meth Enzymol 383:66–93.
246
177. Huang X, Bowman GR, & Pande VS (2008) Convergence of folding free energy landscapes via application of enhanced sampling methods in a distributed computing environment. J Chem Phys 128:205106.
178. McGuffin LJ, Bryson K, & Jones DT (2000) The PSIPRED protein structure prediction server. Bioinformatics 16:404-405.
179. Meiler J, Muller M, Zeidler A, & Schmaschke F (2001) Generation and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks. Journal of Molecular Modeling 7:360-369.
180. Karplus K & Hu BR (2001) Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics 17:713-720.
181. Ouali M & King RD (2000) Cascaded multiple classifiers for secondary structure prediction. Protein Science 9:1162-1176.
182. Bystroff C, Simons KT, Han KF, & Baker D (1996) Local sequence-structure correlations in proteins. Current Opinion in Biotechnology 7:417-421.
183. Engh RA & Huber R (1991) Accurate Bond and Angle Parameters for X-Ray Protein-Structure Refinement. Acta Crystallographica Section A 47:392-400.
184. Neria E, Fischer S, & Karplus M (1996) Simulation of activation free energies in molecular systems. Journal of Chemical Physics 105:1902-1921.
185. Dunbrack RL & Cohen FE (1997) Bayesian statistical analysis of protein side-chain rotamer preferences. Protein Science 6:1661-1681.
186. Lazaridis T & Karplus M (1999) Effective energy function for proteins in solution. Proteins-Structure Function and Genetics 35:133-152.
187. Kortemme T, Morozov AV, & Baker D (2003) An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein-protein complexes. J Mol Biol 326:1239-1259.
188. Morozov AV, Kortemme T, Tsemekhman K, & Baker D (2004) Close agreement between the orientation dependence of hydrogen bonds observed in protein structures and quantum mechanical calculations. Proc Natl Acad Sci USA 101:6946-6951.
189. Park S & Pande VS (2007) Choosing weights for simulated tempering. Phys Rev E Stat Nonlin Soft Matter Phys 76:016703.
190. Shirts M & Chodera J (2008) Statistically optimal analysis of samples from multiple equilibrium states. J Chem Phys 129:124105.
191. Kumar S, Bouzida D, Swendsen RH, Kollman PA, & Rosenberg JM (1992) The Weighted Histogram Analysis Method for Free-Energy Calculations on Biomolecules .1. The Method. J Comp Chem 13:1011-1021.
192. Noble MEM, Musacchio A, Saraste M, Courtneidge SA, & Wierenga RK (1993) Crystal-Structure of the Sh3 Domain in Human Fyn - Comparison of the 3-Dimensional Structures of Sh3 Domains in Tyrosine Kinases and Spectrin. Embo Journal 12:2617-2624.
193. Derrick JP & Wigley DB (1994) The third IgG-binding domain from streptococcal protein G. An analysis by X-ray crystallography of the structure alone and in a complex with Fab. J Mol Biol 243:906-918.
247
194. Cornilescu G, Marquardt JL, Ottiger M, & Bax A (1998) Validation of protein structure from anisotropic carbonyl chemical shifts in a dilute liquid crystalline phase. Journal of the American Chemical Society 120:6836-6837.
195. Heurgue-Hamard V, et al. (2006) The zinc finger protein Ynr046w is plurifunctional and a component of the eRF1 methyltransferase in yeast. Journal of Biological Chemistry 281:36140-36148.
196. Yang JS, Chen WW, Skolnick J, & Shakhnovich EI (2007) All-atom ab initio folding of a diverse set of proteins. Structure 15:53-63.
197. Yang JS, Wallin S, & Shakhnovich EI (2008) Universality and diversity of folding mechanics for three-helix bundle proteins. Proc Natl Acad Sci U S A 105:895-900.
198. Moult J (2005) A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Curr Opin Struct Biol 15:285-289.
199. Das R & Baker D (2008) Macromolecular modeling with rosetta. Annu Rev Biochem 77:363-382.
200. Shmygelska A & Levitt M (2009) Generalized ensemble methods for de novo structure prediction. Proc Natl Acad Sci U S A.
201. Sugita Y, Kitao A, & Okamoto Y (2000) Multidimensional replica-exchange method for free-energy calculations. J Chem Phys 113:6042-6051.
202. Neale C, Rodinger T, & Pomès R (2008) Equilibrium exchange enhances the convergence rate of umbrella sampling Chem Phys Lett 460:375–381.
203. Rao F & Caflisch A (2003) Replica exchange molecular dynamics simulations of reversible folding. J Chem Phys 119:4035-4042.
204. Clarke ND, Kissinger CR, Desjarlais J, Gilliland GL, & Pabo CO (1994) Structural studies of the engrailed homeodomain. Protein Sci 3:1779-1787
205. Tsai CJ, Maizel JV, & Nussinov R (2000) Anatomy of protein structures: Visualizing how a one-dimensional protein chain folds into a three-dimensional shape. Proc Natl Acad Sci USA 97:12038-12043.
206. Haspel N, Tsai CJ, Wolfson H, & Nussinov R (2003) Reducing the computational complexity of protein folding via fragment folding and assembly. Protein Sci 12:1177-1187.
207. Kifer I, Nussinov R, & Wolfson HJ (2008) Constructing templates for protein structure prediction by simulation of protein folding pathways. Proteins 73:380-394.
208. Uhlenbeck OC (1990) Tetraloops and RNA folding. Nature 346:613-614. 209. Jucker FM, Heus HA, Yip PF, Moors EH, & Pardi A (1996) A network of
heterogeneous hydrogen bonds in GNRA tetraloops. J Mol Biol 264:968-980. 210. Woese CR, Winker S, & Gutell RR (1990) Architecture of ribosomal RNA:
constraints on the sequence of "tetra-loops". Proc Natl Acad Sci USA 87:8467-8471.
211. Varani G (1995) Exceptionally stable nucleic acid hairpins. Annual review of biophysics and biomolecular structure 24:379-404.
248
212. Marino JP, Gregorian RS, Csankovszki G, & Crothers DM (1995) Bent helix formation between RNA hairpins with complementary loops. Science 268:1448-1454.
213. Pley HW, Flaherty KM, & McKay DB (1994) Model for an RNA tertiary interaction from the structure of an intermolecular complex between a GAAA tetraloop and an RNA helix. Nature 372:111-113.
214. Glück A, Endo Y, & Wool IG (1992) Ribosomal RNA identity elements for ricin A-chain recognition and catalysis. Analysis with tetraloop mutants. J Mol Biol 226:411-424.
215. Ansari A & Kuznetsov SV (2005) Is hairpin formation in single-stranded polynucleotide diffusion-controlled? The journal of physical chemistry B 109:12982-12989.
216. Roth A, et al. (2007) A riboswitch selective for the queuosine precursor preQ1 contains an unusually small aptamer domain. Nat Struct Mol Biol 14:308-317.
217. Sorin EJ, Rhee YM, & Pande VS (2005) Does water play a structural role in the folding of small nucleic acids? Biophys J 88:2516-2524.
218. Kannan S & Zacharias M (2007) Folding of a DNA hairpin loop structure in explicit solvent using replica-exchange molecular dynamics simulations. Biophys J 93:3218-3228.
219. Garcia AE & Paschek D (2008) Simulation of the pressure and temperature folding/unfolding equilibrium of a small RNA hairpin. J Am Chem Soc 130:815-817.
220. Ansari A, Kuznetsov SV, & Shen Y (2001) Configurational diffusion down a folding funnel describes the dynamics of DNA hairpins. Proc Natl Acad Sci USA 98:7771-7776.
221. Jung J & Van Orden A (2006) A three-state mechanism for DNA hairpin folding characterized by multiparameter fluorescence fluctuation spectroscopy. J Am Chem Soc 128:1240-1249.
222. Ma H, Wan C, Wu A, & Zewail AH (2007) DNA folding and melting observed in real time redefine the energy landscape. Proc Natl Acad Sci USA 104:712-716.
223. Ma H, et al. (2006) Exploring the energy landscape of a small RNA hairpin. J Am Chem Soc 128:1523-1530.
224. Hagen M, Kim B, Liu P, Friesner RA, & Berne BJ (2007) Serial replica exchange. in J Phys Chem B), pp 1416-1423.
225. Menger M, Eckstein F, & Porschke D (2000) Dynamics of the RNA hairpin GNRA tetraloop. in Biochemistry-Us), pp 4500-4507.
226. Zhao L & Xia T (2007) Direct revelation of multiple conformations in RNA by femtosecond dynamics. J Am Chem Soc 129:4118-4119.
227. G. Singh FMaGC (Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. Eurographics Symposium on Point-Based Graphics.
228. Yao Y, et al. (2009) Topological methods for exploring low-density states in biomolecular folding pathways. J Chem Phys 130:144115.
249
229. Kim J, Doose S, Neuweiler H, & Sauer M (2006) The initial step of DNA hairpin folding: a kinetic analysis using fluorescence correlation spectroscopy. in Nucleic Acids Res), pp 2516-2527.
230. Pitera JW, Haque I, & Swope WC (2006) Absence of reptation in the high-temperature folding of the trpzip2 beta-hairpin peptide. The Journal of chemical physics 124:141102.
231. Zhang W & Chen SJ (2002) RNA hairpin-folding kinetics. Proc Natl Acad Sci U S A 99:1931-1936.
232. Mohanty S & Hansmann UH (2006) Folding of proteins with diverse folds. Biophys J 91:3573-3578.
233. Liu P, Huang X, Zhou R, & Berne BJ (2006) Hydrophobic aided replica exchange: an efficient algorithm for protein folding in explicit solvent. J Phys Chem B 110:19018-19022.
234. Im W & Brooks CL (2004) De novo folding of membrane proteins: An exploration of the structure and NMR properties of the fd coat protein. Journal of Molecular Biology 337:513-519.
235. Roitberg AE, Okur A, & Simmerling C (2007) Coupling of replica exchange simulations to a non-Boltzmann structure reservoir. J Phys Chem B 111:2415-2418.
236. Pitera JW, Swope WC, & Abraham FF (2008) Observation of noncooperative folding thermodynamics in simulations of 1BBL. Biophysical journal 94:4837-4846.
237. Zhang W, Wu C, & Duan Y (2005) Convergence of replica exchange molecular dynamics. J Chem Phys 123:154105.
238. Periole X & Mark AE (2007) Convergence and sampling efficiency in replica exchange simulations of peptide folding in explicit solvent. J Chem Phys 126:014903.
239. Nymeyer H (2008) How efficient is replica exchange molecular dynamics? An analytic approach J. Chem. Theory Comput. 4:626–636.
240. Zuckerman DM & Lyman E (2006) A Second Look at Canonical Sampling of Biomolecules Using Replica Exchange Simulation. J. Chem. Theory Comput. 2:1200-1202.
241. Zheng W, Andrec M, Gallicchio E, & Levy RM (2008) Simple continuous and discrete models for simulating replica exchange simulations of protein folding. J Phys Chem B 112:6083-6093.
242. Zheng W, Andrec M, Gallicchio E, & Levy RM (2007) Simulating replica exchange simulations of protein folding with a kinetic network model. Proc Natl Acad Sci U S A 104:15340-15345.
243. Sanbonmatsu KY & Garcia AE (2002) Structure of Met-enkephalin in explicit aqueous solution using replica exchange molecular dynamics. Proteins 46:225-234.
244. Nadler W & Hansmann UH (2007) Dynamics and optimal number of replicas in parallel tempering simulations. Phys Rev E Stat Nonlin Soft Matter Phys 76:065701.
250
245. Nadler W & Hansmann UH (2007) Optimizing replica exchange moves for molecular dynamics. Phys Rev E Stat Nonlin Soft Matter Phys 76:057102.
246. Hummer G & Kevrekidis IG (2003) Coarse molecular dynamics of a peptide fragment: Free energy, kinetics, and long-time dynamics computations. J Chem Phys 118:10762-10773.
247. Ytreberg FM & Zuckerman DM (2008) A black-box re-weighting analysis can correct flawed simulation data. Proc Natl Acad Sci U S A 105:7982-7987.
248. Levitt M (1972) Folding of nucleic acids. Ciba Found Symp 7:147-171. 249. Schutte C & Huisinga W (2003) Biomolecular conformations can be identified
as metastable sets of molecular dynamics. Handbook of numerical analysis:699-744.
250. Schutte C & Huisinga W (2000) Biomolecular conformations as metastable sets of Markov chains. Proceedings of the 18th Annual Allerton Conference on Communication, Control, and Computing:1106-1115.
251. Noe F (2008) Probability distributions of molecular observables computer from Markov models. J Chem Phys 128:244103.
252. Zwanzig R (1995) Simple-Model of Protein-Folding Kinetics. Proceedings of the National Academy of Sciences of the United States of America 92:9801-9804.
253. Brzezniak Z & Zastawniak T (1999) Basic stochastic processes : a course through exercises (Springer, London ; New York) pp x, 225 p.
254. Bacallado S, Chodera JD, & Pande V (2009) Bayesian comparison of Markov models of molecular dynamics with detailed balance constraint. J Chem Phys 131:045106.
255. Van der Spoel D, et al. (2005) GROMACS: Fast, flexible, and free. Journal of computational chemistry 26:1701-1718.
256. Still WC, Tempczyk A, Hawley RC, & Hendrickson T (1990) Semianalytical Treatment of Solvation for Molecular Mechanics and Dynamics. Journal of the American Chemical Society 112:6127-6129.
257. Lovell SC, et al. (2003) Structure validation by C alpha geometry: phi,psi and C beta deviation. Proteins-Structure Function and Genetics 50:437-450.
258. Berezhkovskii A, Hummer G, & Szabo A (2009) Reactive flux and folding pathways in network models of coarse-grained protein dynamics. J Chem Phys 130:205102.
259. Fersht AR (1997) Nucleation mechanisms in protein folding. Current Opinion in Structural Biology 7:3-9.
260. Karplus M & Weaver DL (1976) Protein-Folding Dynamics. Nature 260:404-406.
261. Weber M & Kube S (2005) Robust Perron Cluster Analysis for various applications in computational life science. Computational Life Sciences, Proceedings 3695:57-66.
262. Cornell WD, P. Cieplak, C. I. Bayly, I. R. Gould, K. M. Merz, D. M. & Ferguson DCS, T. Fox, J. W. Caldwell, and P. A. Kollman (1995) A second
251
generation force field for the simulation of proteins, nucleic acids, and organic molecules. J. Am. Chem. Soc. 117:5179-5197.
263. Jorgensen WL, Chandrasekhar J, Madura JD, Impey RW, & Klein ML (1983) Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 79:926-935.
264. Darden T, D. York, and L. Pedersen. (1995) A smooth particle mesh Ewald potential. J. Chem. Phys. 103:3014-3021.
265. Hess B, H. Bekker, H. J. C. Berendsen, and J. G. E. M. Fraaije. (1997) LINCS: a linear constraint solver for molecular simulations. J. Comput. Chem. 18:1463-1472.
266. Macke TJ & Case DA (1998) Modeling unusual nucleic acid structures. Molecular Modeling of Nucleic Acids 682:379-393.
267. DUAN Y, et al. (2003) A Point-Charge Force Field for Molecular Mechanics Simulations of Proteins Based on Condensed-Phase Quantum Mechanical Calculations. J. Comp. Chem. 24:1999-2012.