markov state models for protein and rna folding a ...ky974bm1455/gregory_r_bowm… · markov state...

MARKOV STATE MODELS FOR PROTEIN AND RNA FOLDING

A DISSERTATION

SUBMITTED TO THE PROGRAM IN BIOPHYSICS

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Gregory R. Bowman

July 2010

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/ky974bm1455

© 2010 by Gregory Ross Bowman. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii



http://purl.stanford.edu/ky974bm1455

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Vijay Pande, Primary Adviser


Russ Altman


Daniel Herschlag

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

iv

ABSTRACT

Understanding the molecular bases of human health could greatly augment our ability

to prevent and treat diseases. For example, a deeper understanding of protein folding

would serve as a reference point for understanding, preventing, and reversing protein

misfolding in diseases like Alzheimer’s. Unfortunately, the small size and tremendous

flexibility of proteins and other biomolecules make it difficult to simultaneously

monitor their thermodynamics and kinetics with sufficient chemical detail. Atomistic

Molecular Dynamics (MD) simulations can provide a solution to this problem in some

cases; however, they are often too short to capture biologically relevant timescales

with sufficient statistical accuracy. We have developed a number of methods to

address these limitations. In particular, our work on Markov State Models (MSMs)

now makes it possible to map out the conformational space of biomolecules by

combining many short simulations into a single statistical model. Here we describe our

use of MSMs to better understand protein and RNA folding. We chose to focus on

these folding problems because of their relevance to misfolding diseases and the fact

that any method capable of describing such drastic conformational changes should

also be applicable to less dramatic but equally important structural rearrangements like

allostery. One of the key insights from our folding simulations is that protein native

states are kinetic hubs. That is, the unfolded ensemble is not one rapidly mixing set of

conformations. Instead, there are many non-native states that can each interconvert

more rapidly with the native state than with one another. In addition to these general

observations, we also demonstrate how MSMs can be used to make predictions about

the structural and kinetic properties of specific systems. Finally, we explain how

MSMs and other enhanced sampling algorithms can be used to drive efficient

sampling.

v

ACKNOWLEDGMENTS

Thanks to my family and my God for giving me the passion, intellect, and opportunity

to do this work. It is difficult to imagine life without the love, support, and training my

parents, brother, and wife have given me. Graduate school—and life in general—have

been much more enjoyable with the companionship of my beautiful wife Angela.

Thanks to my advisor, Vijay Pande, for being such a superb guide, for creating

such an intellectually invigorating environment, and for being so generous with

resources of all kinds. My lab-mates have also been great. I’m especially indebted to

Xuhui Huang for helping to jump-start my progress by working so closely with me

during my rotation and the early years of my PhD. Sergio Bacallado, Kyle

Beauchamp, John Chodera, Dan Ensign, Imran Haque, Peter Kasson, Yu-Shan Lin,

Paul Novick, and Vince Voelz were all great collaborators. Thanks to Jason Wagoner

and Del Lucent for all the conversations about science, religion, politics, and

philosophy.

Thanks to my committee members, Russ Altman and Dan Herschlag, for

making time to help me along the way. Dan was especially generous in including me

in his group and getting me into the wet-lab. Seb Doniach has also been like a co-

advisor.

vi

Table of Contents

List of tables ........................................................................................................................x

List of figures .....................................................................................................................xi

Introduction .........................................................................................................................1

Chapter 1: Using generalized ensemble simulations and Markov state models to

identify conformational states .......................................................................................6

Abstract..........................................................................................................................6

Introduction ...................................................................................................................6

Description of Method.................................................................................................10

Conclusions .................................................................................................................17

Chapter 2: Progress and challenges in the automated construction of Markov state

models for full protein systems ...................................................................................19

Abstract........................................................................................................................19

Introduction .................................................................................................................20

Materials & Methods...................................................................................................24

Results & Discussion...................................................................................................29

Conclusions .................................................................................................................45

Chapter 3: Molecular simulation of ab initio protein folding for a millisecond

folder NTL9(1-39).......................................................................................................47

Abstract........................................................................................................................47

Introduction .................................................................................................................48



Conclusions .................................................................................................................55

Chapter 4: Protein folded states are kinetic hubs ..............................................................56

Abstract........................................................................................................................56

Introduction .................................................................................................................57


Conclusions .................................................................................................................71

vii


Chapter 5: Atomistic folding simulations of the five helix bundle protein Lambda

..................................................................................................................................75

6-

85

Abstract........................................................................................................................75

Introduction .................................................................................................................76


Conclusions .................................................................................................................84

Chapter 6: Enhanced modeling via network theory: Adaptive sampling of Markov

state models .................................................................................................................86

Abstract........................................................................................................................86

Introduction .................................................................................................................86

Theoretical Underpinnings ..........................................................................................89


Conclusions ...............................................................................................................107

Chapter 7: Simulated tempering yields insight into the low-resolution Rosetta

scoring functions .......................................................................................................108

Abstract......................................................................................................................108

Introduction ...............................................................................................................109

Methods .....................................................................................................................111

Results .......................................................................................................................119

Discussion..................................................................................................................128

Conclusions ...............................................................................................................132

Chapter 8: The roles of entropy and kinetics in structure prediction ..............................133

Abstract......................................................................................................................133

Introduction ...............................................................................................................134

Results & Discussion.................................................................................................136

Conclusions ...............................................................................................................144

Materials & Methods.................................................................................................145

Chapter 9: Structural insight into RNA hairpin folding intermediates............................148

viii

Abstract......................................................................................................................148

Introduction ...............................................................................................................148


Conclusions ...............................................................................................................156

Chapter 10: Rapid equilibrium sampling initiated from non-equilibrium data ...............157

Abstract......................................................................................................................157

Introduction ...............................................................................................................158


Conclusions ...............................................................................................................170

Materials & Methods.................................................................................................171

Appendix A: Estimating transition matrices and equilibrium distributions....................172

Appendix B: The possibility of longer timescales than the implied timescales..............175

Appendix C: Supporting information for chapter 3 ........................................................177

Molecular dynamics simulation ................................................................................177

Markov State Model (MSM) construction ................................................................178

Transition Pathway Theory (TPT) analysis...............................................................178

Structural analysis of macrostate ensembles .............................................................179

Analysis of states along folding pathways: comparison between secondary

structure formation and reaction progress (p ).................................................180 fold

How does NTL9 fold in our simulations? .................................................................181

Appendix D: Supporting information for chapter 4 ........................................................187

Villin MSM ...............................................................................................................187

Simple models ...........................................................................................................191

Appendix E: Supporting information for chapter 5.........................................................199

Simulation Details .....................................................................................................199

MSM Construction and Analysis ..............................................................................199

Appendix F: Supporting information for chapter 6.........................................................210

Appendix G: Supporting information for chapter 9 ........................................................211

Serial Replica Exchange (SREMD) ..........................................................................211

ix

Simulation Details .....................................................................................................211

Topological Method (Mapper) for Pathway Analysis...............................................212

PEDFs........................................................................................................................213

Melting Curves ..........................................................................................................214

Appendix H: Supporting information for chapter 10 ......................................................216

Initial Configurations.................................................................................................216

The Convergence of Weights in Simulated Tempering (ST)....................................216

Molecular Dynamics (MD) Simulation Details ........................................................220

Hierarchical K-medoids clustering algorithm ...........................................................220

Markov State Models ................................................................................................221

A simple model of non-Arrhenius, metastable dynamics .........................................226

Bibliography ....................................................................................................................236

x

LIST OF TABLES

Number Page

Table 1. Exponential fits, MFPT’s, and lag phases (all in units of steps) for

transitioning from the unfolded state(s) to the native state in the three

simple models. ...............................................................................................198

Table 2. Convergence of the weights is shown for representative temperatures Δg

= g − g obtained from distributed computing simulations starting from

a helical structure (third column) and a coil structure (fourth column) at

different temperature pairs. Differences between free energy

differences Δf = g /β −g /β obtained from simulations starting from a

helical structure and a coil structure are displayed in the 5th column.

KT at temperature i is shown in the sixth column. Δf (Helical)-

Δf (coil)(KJ/mol) is smaller than KT (KJ/mol) at all temperature pairs. .....221

j i

ji j j i i

ji

ji

Table 3. Metastability (Q) and average self-transition probability <P > between

metastable states for the MSMs built from ST simulations and seeding

simulations.....................................................................................................225

i i

xi

LIST OF FIGURES

Number Page

Figure 1. Schematic of the steps required for building an MSM and obtaining

representative conformations for each state. First, GE data represented

by points are grouped into microstates represented by circles, with

darker circles for more highly populated microstates. Kinetically

related microstates are then lumped together into macrostates, or

metastable states, represented by amorphous shapes. Finally,

representative conformations are obtained by extracting the most

probable conformation from each macrostate. ................................................10

Figure 2. Implied timescales as a function of the lag time. There are two probable

gaps in the implied timescales. If gap one were selected then a

macrostate MSM with four states would be constructed whereas if gap

two were selected a higher resolution MSM with 6 states would be

constructed.......................................................................................................14

Figure 3. Scatter plots of the free energy of each microstate (in kcal/mol) versus its

RMSD. A) The initial 10,000 state model, B) the 30,000 state model,

C) the final 10,000 state model, and D) the final 10,000 state model

except that the average RMSD across five structures in each state is

used instead of the RMSD of the state center..................................................31

Figure 4. Top ten implied timescales for the initial 10,000 state model. ..........................31

Figure 5. Three representative structures for A) the lowest RMSD state in the final

model and B) the most probable state in the final model overlaid with

the crystal structure (red). The phenylalanine core is shown explicitly

for each molecule. ...........................................................................................35

Figure 6. Top ten implied timescales for the final model. A) The implied

timescales at intervals of one ns. B) The implied timescales with error

xii

bars obtained by doing five iterations of bootstrapping at an interval of

five ns. .............................................................................................................38

Figure 7. The average RMSD of each state in the final model versus its left

eigenvector component in the longest timescale transition showing that

this transition corresponds to folding. .............................................................39

Figure 8. Comparison between the time evolution of the native population in the

MSM (blue) and the raw data (black) for the entire dataset. The error

bars represent the standard error......................................................................40

Figure 9. Comparison between the time evolution of the RMSD in the MSM

(blue), the reduced representation (yellow), and the raw data (black) for

A) an example of good agreement and B) an example of the worst case

scenario. The error bars represent one standard deviation in the RMSD. .......42

Figure 10. Improved agreement between the MSM and raw data for the example of

poor agreement from Figure 6B obtained by building the transition

probability matrix from simulations started from this starting structure

alone. The error bars represent one standard deviation in the RMSD.............44

Figure 11. (a) Distributions of RMSD-C for native-state simulations of NTL9(1-

39) after 10 µs. The arrows indicate thresholds defined for the native

basin at 3.5Å and 4Å. (b) The number of parallel simulations M(t)

started from unfolded states at 370K that reach time t. (c) Posterior

predictions of the folding rate given the amount of simulation time and

observed folding events for 3.5Å (dashed) and 4Å (solid) thresholds,

using uniform (black) and Jeffrey’s (gray) priors, using methods from

(85). In red is a Gaussian distribution representing the experimental rate

mean and standard deviation. ..........................................................................50

Figure 12. (a) A snapshot from a folding trajectory (dark blue) achieves an RMSD-

C of 3.1Å compared to the native state (cyan). (b) Non-native (top)

and native-like (bottom) hydrophobic core arrangements observed in

low-RMSD conformations of folding trajectories. Highlighted are

xiii

sidechains of residues F5 (magenta), V3,V9,V21 (tan), and L30,L35

(pink). ..............................................................................................................51

Figure 13. A 2000-state Markov State Model (MSM) was built using a lag time of

12 ns. Shown is the superposition of the top 10 folding fluxes,

calculated by a greedy backtracking algorithm (see Appendix C). These

pathways account for only about 25% of the total flux, and transit only

14 of the 2000 macrostates (shown labeled a-n, for convenient

discussion). The visual size of each state is proportional to its free

energy, and arrow size is proportional to the inter-state flux. .........................52

Figure 14. The 14 macrostates involved in the top ten folding pathways, plotted

along structural and kinetic reaction coordinates. The balance between

native-like helix and sheet structure is quantified by Q – (Q +

Q )/2 (vertical axis), and progress along the folding reaction is

quantified by the p (committor) value (horizontal axis). It can be

seen that the “unfolded” state (a) contains residual native-like helical

propensity, and that pathways involving various ordering of native-like

helix and sheet formation are possible. ...........................................................54

α β12

β13

fold

Figure 15. Q-values, which capture the extent of native-like structures, plotted

versus p (committor) values. The lines are to guide to eye. .......................54 fold

Figure 16. Three representative networks each having unfolded state(s) (U and

U ), intermediates (I ), and a native state (N). S has a single pathway, P

has parallel pathways, and H has a heterogeneous unfolded state. .................61

i i

Figure 17. Distributions of the first folding times for the simple networks S, P, and

H are shown in panels A, B, and C respectively. The blue lines are

exponential fits to the data after the initial lag phase. .....................................62

Figure 18. Relaxation of villin from 500 state model. Distributions of the MFPTs

from (A) unfolded states to the native state and (B) between unfolded

states. (C) Relaxation kinetics with a 10:1 signal-noise ratio (black

xiv

curve with Gaussian noise) and a single exponential fit (blue curve with

τ≈810 ns). ........................................................................................................64

Figure 19. Schematic diagrams of funnel and native hub models having unfolded

states (U), intermediates (I), and native states (N). (A) A network

description of a folding funnel with nodes corresponding to individual

conformations and a bottleneck near the native state. (B) A native hub

model with metastable nodes. The size of each node in (B) is correlated

with its equilibrium probability and the connectivity falls off as one

moves away from the native state. ..................................................................67

Figure 20. Distance between the final villin MSM and MSMs constructed from

subsets of the data (varying trajectory length and number of

trajectories). Distance is measured by a relative entropy metric (see

Appendix D for details). Black lines are contours of equal amounts of

data. No data was available for the upper-right portion of the graph..............70

Figure 21. (A) The crystal structure of the λ dimer bound to DNA (PDB code

1LMB). (B) A model of λ with the Trp22-Tyr33 pair monitored in

T-jump experiments space-filled. ....................................................................77

1-92

6-85

Figure 22. One of the 10 millisecond timescale pathways labeled with p values

(the probability of reaching state H before state A). .......................................80

fold

Figure 23. The 500 most populated macrostates with sizes proportional to their

free energies and connections between states if transitions between

them occurred in our simulations. The native state (green state with

green connections) is a hub. The crystallographic state from Figure 22H

is blue, the compact β-sheet state from Figure 22A is red, and the

remaining states are yellow. All of these states have smaller

equilibrium populations and fewer connections than the native state. ............83

Figure 24. Distributions of mean first passage times (MFPTs) between sets of

microstates (A) without weighting the distribution and (B) weighting

each MFPT by the equilibrium probability of the starting state. The

xv

solid line is the distribution of MFPTs from non-native to native

microstates and the dashed line is the distribution of MFPTs between

non-native states. The average MFPT from non-native states to native

ones is about 10 times faster than that between non-native states in (A)

and the difference is even greater in (B). Native microstates were

defined as those in the most populated macrostate. All other microstates

were considered non-native. ............................................................................83

Figure 25. Scaling for adaptive sampling of villin as the number of parallel

simulations (N) used during each round is varied. (A) Wall-clock time

scaling as N is varied. The black line is a best fit to the linear portion of

the data (circles), which extends up to 5,000 simulations per iteration.

(B) Computer time required to achieve a given model quality (relative

entropy) for various sampling schemes. L refers to one long trajectory

and the numbers refer to the number of parallel simulations used in

each iteration of adaptive sampling. All results come from averaging

over ten independent runs. Each step equates to 15 ns....................................98

Figure 26. (A) The two models, S and P. (B) Distance from the true model

(measured via the relative entropy) as a function of wall-clock time for

adaptive sampling versus one long simulation of S (assuming 5

steps/day to mimic 5 nanoseconds/day in protein folding simulations).

The lines are one long simulation (dashed line) and adaptive sampling

with 10 simulations of 20 steps (solid line), 10 simulations of 200 steps

(dotted line), 100 simulations of 20 steps (dash-dot line), and 1000

simulations of 20 steps (black squares) per iteration.....................................100

Figure 27. Relative entropy (top) and free energy of each state in kcal/mol

(bottom) as a function of the adaptive sampling iteration on model S..........102

Figure 28. Distance from the true model (measured via the relative entropy) as a

function of the number and length of simulations averaged over 10

independent samples. (A) Reference distribution for S, (B) adaptive

xvi

sampling of S, (C) reference distribution for P, and (D) adaptive

sampling of P. All simulations for the reference distributions started

from state 1. The first 10 simulations for adaptive sampling started

from state 1 and subsequent batches of simulations started from the

state contributing most to uncertainty in the slowest process. Black

lines are contours of equal amounts of data. .................................................103

Figure 29. Scaling for adaptive sampling of our simple models as the number of

parallel simulations (N) used during each round is varied. (A) and (B)

Wall-clock time scaling as N is varied for simple models S and P

respectively. The black line is a best fit to the linear portion of the data

(circles). (C) and (D) Computer time required to achieve a given model

quality (relative entropy) for various sampling schemes applied to S

and P respectively. L refers to one long trajectory and the numbers refer

to the number of parallel simulations used in each iteration of adaptive

sampling. All results come from averaging over ten independent runs. .......105

Figure 30. Flow chart showing the order the scoring functions are used in and

giving brief descriptions of each. After score5, Rosetta returns to

score2 five times before progressing to score3. The first six scoring

functions constitute the low-resolution de novo structure prediction

phase. .............................................................................................................113

Figure 31. Score versus RMSD (Å ) for an SH3 domain (PDB code 1shf). Each

diamond represents the lowest scoring structure for a single run. Data

for ST is shown in blue while data for standard Rosetta is shown in red.

The black ‘‘+’’ symbols represent models obtained by idealizing and

relaxing the crystal structure in low-resolution mode. ..................................120

Figure 32. Score versus RMSD (Å ) for protein G (PDB code 1igd). Each diamond

represents the lowest scoring structure for a single run. Data for ST is

shown in blue while data for standard Rosetta is shown in red. Panel

(A) shows results from the low-resolution phase. The black ‘‘+’’

xvii

symbols represent models obtained by idealizing and relaxing the

crystal structure in low-resolution mode. Panel (B) shows results from

the full-atom phase. The yellow circles represent models obtained by

idealizing and relaxing the crystal structure in full-atom mode. The

black ‘‘*’’ symbols are full-atom models obtained by relaxing the low-

resolution structures depicted by ‘‘+’’ symbols in (A) using the full-

atom scoring functions. .................................................................................121

Figure 33. Evolution of the score4 weights for protein G. The dashed line is the

difference between the weights of the highest two temperatures: 10 and

20 kT. The solid line is the difference between the weights of the

lowest two temperatures: 0.1 and 0.25 kT. The first points come from

constant temperature runs and subsequent points represent each

iteration of refining the weights. Δg=g -g where, j > i.................................123 j i

Figure 34. Projections of the free energy landscape onto score versus RMSD (Å )

for protein G in score4 using: (A) standard Rosetta runs starting from

an extended chain, (B) standard Rosetta runs starting from the native

state, (C) ST runs at 0.1 kT starting from an extended chain, (D) ST

runs at 0.1 kT starting from the native state, (E) ST runs at 2 kT starting

from the native state, (F) ST runs at 20 kT starting from the native

state. Each white plus-sign corresponds to the lowest scoring structure

for a single run. The lowest scoring structures from each run were

sorted by RMSD and only every twentieth point is shown so as to give

the entire range without obscuring the underlying plot.................................124

Figure 35. Projections of the free energy landscape onto score versus RMSD (Å )

for protein G. Each white plus-sign corresponds to the lowest scoring

structure for a single run. The lowest scoring structures from each run

were sorted by RMSD and only every twentieth point is shown so as to

give the entire range without obscuring the underlying plot. (A), (D),

(G), and (J) show data from standard Rosetta runs with frequent

xviii

recovery of the lowest scoring structure in score1, score2, score5, and

score3 respectively. (B), (E), (H), and (K) show data from standard

Rosetta runs without frequent recovery of the lowest scoring structure

in score1, score2, score5, and score3 respectively. (C), (F), (I), and (L)

show data from ST runs at 0.1 kT without frequent recovery of the

lowest scoring structure in score1, score2, score5, and score3,

respectively....................................................................................................127

Figure 36. Time evolution of the C RMSD of the current umbrella center for five

representative simulations demonstrating the presence of reversible

folding............................................................................................................137

α

Figure 37. Average energy (<∆E>), conformational entropy (<∆S>), and free

energy (<∆F>) as a function of C RMSD for protein G and engrailed

homeodomain (EH). ......................................................................................138

α

Figure 38. Average free energies (<∆F>) as a function of C RMSD for

temperatures of 0.5 and 0.1 for protein G and engrailed homeodomain

(EH). The black lines are the hypothesized free energy at the given

temperature and the dash-dot lines are the free energy at temperature

0.8 shown for reference. ................................................................................140

α

Figure 39. (A) The native structure of protein G and (B) the 5.7 Å starting

structure used for comparing the ST and Standard Rosetta variants.............142

Figure 40. Distribution of the minimum C RMSD values reached by 100

Simulated Tempering (ST) and 100 standard Rosetta runs started from

a 5.7 Å structure. Results for both the low temperature and standard

Rosetta variants were identical so only a single plot is shown......................142

α

Figure 41. Relative magnitude of the average hydrogen bonding energy (solid line)

versus the total average energy (dash-dot line) as a function of Cα

RMSD for protein G and engrailed homeodomain (EH). .............................143

xix

Figure 42. (A) NMR structure of the GCAA tetraloop. (B) Contact map for the

native state. Bases are numbered from 5’ to 3’ and native base-pair

contacts (dotted lines) are numbered 1-4.......................................................149

Figure 43. The probability of a given number of native contacts during (A)

unfolding and (B) refolding. (C) The probability of each contact when a

given number of contacts are present during unfolding and refolding

with the arrows representing the direction of movement between the

unfolded state (U) and the folded state (F). ...................................................153

Figure 44. Contact maps representing the cluster centers from independent

clustering of the unfolding (A) and refolding data (B). The grey lines

represent the connectivity of the states. The blue lines represent native

contacts with a probability of 0.6 or greater within the cluster.

Intermediate structures are labeled A-D........................................................153

Figure 45. Representative full-atom structures for the intermediate states with

labels (A)-(D) corresponding to the labels A-D in Figure 3. ........................155

Figure 46. A schematic free energy landscape with three representative seeding

trajectories started from each basin and a projection of this free energy

landscape onto a 2D plain showing the division into metastable states. .......161

Figure 47. Schematic of the adaptive seeding scheme. The top arrow represents

our ST trajectories, which are split into equilibration (green) and

production (light blue) phases. The light red and light yellow boxes

encompass our long and short adaptive seeding schemes respectively.

For each adaptive seeding scheme, the dotted lines demark the portion

of the ST data used to identify the dominant thermodynamic, or

metastable, states by building an MSM (S). Constant temperature (or

canonical, NVT) simulations are then started from each state and used

to build a new MSM (E) that captures the equilibrium distribution.

Both the light yellow and red boxes also encompass a portion of the

original ST data that is equivalent to the amount of sampling used in

xx

the adaptive seeding scheme. An MSM is also built for this data and

used as a baseline for judging the efficiency of the adaptive seeding

scheme. ..........................................................................................................163

Figure 48. Population of each state (bar graphs correspond to the mean values, and

error bars stand for standard deviations) for (A) the long adaptive

seeding scheme (lag time t=4.5 ns) and (B) the short adaptive seeding

scheme (lag time t=4.5 ns).............................................................................165

Figure 49. Population of each state for the long adaptive seeding scheme as the lag

time is varied. ................................................................................................166

Figure 50. Representative structure for each of the six metastable states. The

numbering is the same as in Figures 48 and 49.............................................170

Figure 51. Graph depiction of the model system defined in Appendix B with edges

labeled by A) their probability and B) their average timescale under a

two-state assumption. ....................................................................................176

Figure 52. (a) Implied timescales for a series of 100,000-microstate Markov State

Models (MSMs) built at lag times between 1 and 32 ns. As the longest

timescale levels off beyond a lag time of 10 ns, a lag time of 12 ns was

chosen to build subsequent MSMs. The spectral gap present at all lag

times indicates apparent two-state folding kinetics. (b) The implied

timescales for a 2000-macrostate model built by lumping states from

the microstate MSM show a similar spectral gap and leveling off of

time scales. The faster implied timescales of the macrostate model at

short lag times are due to lumping effects. (c) The 10 slowest implied

timescales for the 2000 state models, with error analysis from a

bootstrapping procedure. Error bars represent the standard deviation

from the bootstrap analysis............................................................................183

Figure 53. A scatter plot of the 2000 macrostates obtained by lumping the

100,000-state MSM calculated from the simulation data at 370K. The

RMSD-to-native is calculated using the peptide backbone residues,

xxi

with respect to the native starting state. The free energy of each

microstate i is computed as –kT ln (p /p ), where p is the equilibrium

probability of the microstate, and p is an arbitrary reference (in this

case, max(p )). Shown in red are the 14 macrostates transited by the top

ten pathway fluxes, labeled with the same letters as in Figure 13. In this

mesoscopic view, we find that 1) the macrostates are diffuse collections

of conformational states, 2) there are multiple folding pathways along

these metastable states, and 3) we can identify highly populated

“native” (state n) and “unfolded” (state a) macrostates that dominate

the observed relaxation rates. The red arrow is meant to guide to eye in

illustrating a “mesoscopic” view of the transition state barrier: the

“unfolded” state (a) and “native” state (n) are at free energy minima,

while intermediate RMSD values have macrostates with higher free

energies..........................................................................................................184

i 0 i

0

i

Figure 54. Contact profile subspaces used to calculate Q , Q , and Q , which

quantify the extent of native-like structuring for beta-strand and

pairing, beta-strand and pairing, and helix formation,

respectively....................................................................................................184

12 13

1 2

1 3

Figure 55. Here, contact profiles (see definition above) for the 14 macrostates

involved in the top ten folding pathways are plotted in a similar fashion

to Figure 55. For clarity, the pathway arrows have been removed. Each

contact profile is a 39 x 39 matrix of inter-residue contacts, showing the

contact fraction on a linear grayscale from 0 (white) to 1 (black). ...............185

Figure 56. Here, values of Q (yellow), Q (red), and Q (blue) are plotted in a

bar graph for each of the 14 macrostates involved in the top ten folding

pathways. The layout is in a similar fashion to Figure 56.............................185

12 13

Figure 57. Macrostates l, m and n (the “native” state) have very similar structural

ensembles and similar p values (p > ~0.93). To examine the

subtle differences in their macrostate contact profiles, we computed

fold fold

xxii

difference contact profiles for (l-m), (n-l) and (n-m) transitions. These

difference maps reveal that these states differ mostly in their hairpin

registrations and packing of the hairpin loop. ...............................................186

Figure 58. Implied timescales for the villin macrostate MSM........................................194

Figure 59. Distribution of MFPTs between all pairs of non-native states for villin

(A) on a linear scale to demonstrate the peak does not shift significantly

relative to the distribution shown in Figure 18B and (B) on a log scale

to highlight that the tail of the distribution does extend to about 60 ns. .......194

Figure 60. Distributions of the MFPTs (A) from each non-native state to the native

state and (B) between every pair of non-native states for our 2,000 state

NTL9(1-39) model. As discussed in Ref (93), further refinement of

this model is likely necessary. However, we do not expect the

qualitative trend of long timescales (relative to folding) for

transitioning between unfolded states to change. ..........................................195

Figure 61. Two conformations from different unfolded basins demonstrating the

structural heterogeneity of non-native states (especially in their non-

native contacts) that, in combination with the vastness of

conformational space, result in slow transitions between unfolded

states. The structures are colored red to blue from the N-terminus to

the C-terminus. Atoms for residues Arg 14, Trp 23, and Lys 32 are

shown to highlight that 23 and 32 are in contact on the left while the

chain has rearranged such that 14 and 32 are in contact on the right.

These images were made with VMD (67).....................................................195

Figure 62. Relaxation of the fraction folded starting from equally populated

unfolded states (black is data and blue is single exponential fit with

τ≈810 ns). The beginning of the curve is dominated by single

exponential relaxation but deviations from this apparent two-state

behavior become apparent later.....................................................................196

xxiii

Figure 63. Relaxation of the fraction unfolded for a villin model at the microstate

level (thick black line) and a biexponential fit (thin blue line) with time

constants of ~60 and ~415 ns, at least qualitatively consistent with time

constants of ~70 and ~720 ns from experiment (56). We hope to

explain this behavior in a future work on villin. As in Ref. (4), the

native state was defined as all microstates with an average C RMSD to

the crystal structure less than 3 Å..................................................................197

α

Figure 64. The distance to the gold-standard model, measured via the relative

entropy, for 40,000 trajectories up to 400 nanoseconds in length. The

black lines are contours of equal amounts of data. Again, there was

insufficient data to resolve the upper right-hand corner of the plot. .............198

Figure 65. Implied timescales for the full 370 K dataset. ...............................................202

Figure 66. Implied timescales for the 300 K dataset. ......................................................202

Figure 67. Implied timescales for ¾ of the 370 K dataset selected at random................203

Figure 68. A coarse-grained view of the slowest transition with state sizes

proportional to the free energy and arrow widths proportional to the

flux (see key in figure). .................................................................................203

Figure 69. Another coarse-grained view of the slowest transition with state sizes

proportional to the free energy and arrow widths proportional to the

flux (see key in figure). Here the states are laid out in terms of the

average number of β-sheet residues (calculated from 100 random

conformations from each state) and the p (probability of reaching the

crystallographic state in L before the compact β-sheet state in A)................204

fold

Figure 70. Free energy projections of the microstate MSM onto typical order

parameters like the radius of gyration (Rg), the C RMSD to the crystal

structure, and the distance between the Trp22 and Tyr33 residues.

Differences between the two panels highlight the difficulty in

interpreting such projections. ........................................................................206

α

xxiv

Figure 71. Free energy projection of the microstate MSM onto Pfold and the

distance between the Trp22 and Tyr33 residues. Obtaining projections

onto kinetic order parameters like Pfold is greatly simplified with

MSMs. In this case Pfold refers to the probability of reaching the

crystallographic state before reaching the compact β-sheet state (i.e. the

slow transition from Figure 21). Unlike the projections in, this one

hints that D14A may not be well described by a simple two- or three-

state model or that the Trp22-Tyr33 distance is not a good reaction

coordinate, since there are a broad range of Pfold values possible for a

given Trp-Tyr distance. Indeed, analysis of the MSM reveals that

D14A is best described by a native hub. .......................................................206

Figure 72. The ten most populated macrostates with their equilibrium probabilities. ....207

Figure 73. Relaxation of the fraction unfolded with different observables and

observation times. The thick black curves come from the MSM and the

thin blue curves from biexponential fits to the MSM relaxation. The top

row shows relaxation of the fraction unfolded measured by the Trp22-

Tyr33 distance (A) starting from all states being equally populated and

(B) starting from all non-native states being equally populated. The

bottom row shows relaxation of the fraction unfolded measured by the

C RMSD to the crystal structure (C) starting from all states being

equally populated and (D) starting from all non-native states being

equally populated. Fitting parameters are given in the figure (in units of

microseconds). In this case, the fitting parameters are relatively

independent of the observable and starting distribution................................207

α

Figure 74. Relaxation of the fraction unfolded with different observables and

observation times from an MSM built without the trajectories started

from β-sheet structures. The thick black curves come from the MSM

and the thin blue curves from biexponential fits to the MSM relaxation.

The top row shows relaxation of the fraction unfolded measured by the

xxv

Trp22-Tyr33 distance (A) starting from all states being equally

populated and (B) starting from all non-native states being equally

populated. The bottom row shows relaxation of the fraction unfolded

measured by the C RMSD to the crystal structure (C) starting from all

states being equally populated and (D) starting from all non-native

states being equally populated. Fitting parameters are given in the

figure (in units of microseconds). In this case the fitting parameters are

more dependent on the observable, consistent with the experimental

observation of probe dependent kinetics. ......................................................208

α

Figure 75. Projection of the free energy onto pfold (A) from the compact β-sheet

state in Figure 22A to the native state in Figure 22H, (B) from the

extended state in Figure 22E to the native state in Figure 22H, and (C)

from the extended state in Figure 22E to the native state in Figure 22G.

None are purely downhill, though some may be consistent with

incipient downhill folding (i.e. have sufficiently low barriers that there

is a reasonable population at the barrier top that can fold in a downhill

manner in addition to activated folding across the barrier). ..........................209

Figure 76. The helicity of each residue predicted from Agadir.(143) The purple,

numbered bars show where the five helices are (the extra purple block

between helices 4 and 5 is a turn)..................................................................209

Figure 77. Uncertainty in the log base 10 of the relative entropies averaged over 10

independent samples of (A) reference simulations of M1 and (B)

adaptive sampling of M1. Black lines are contours of equal amounts of

data. ...............................................................................................................210

Figure 78. Uncertainty in the log base 10 of the relative entropies averaged over 10

independent samples of (A) reference simulations of M2 and (B)

adaptive sampling of M2. Black lines are contours of equal amounts of

data. ...............................................................................................................210

xxvi

Figure 79. (a) Potential Energy Distribution Functions (PEDFs) generated from

Folding@home data at each of the 56 temperatures used. (b). The

convergence measure averaged over all temperatures as a function of

time. Triangles correspond to using P as the reference distribution

and circles correspond to using P as the reference. ................................214

2

final

initial

Figure 80. Native contacts melting curve. Only every third temperature is

displayed for clarity. ......................................................................................215

Figure 81. The two initial structures used in this study: A) A near-native

conformation and B) a random coil conformation. .......................................216

Figure 82. Amount of sampling at different temperatures for ST simulations

started from the native (top row) and coil configurations (bottom row)

computed from different segment of simulation time 0-0.3ns, 1.2-1.5

ns, 2.7-3.0 ns, and 8.7-9.0ns are displayed. Uniform sampling is

reached for both sets of ST simulations indicating the weights are

converged. .....................................................................................................220

Figure 83. Three example structures from a single microstate. ......................................224

Figure 84. The largest one hundred implied timescales as a function of the lag time

for (a) ST simulations starting from the coil initial configuration. (b)

The long adaptive seeding microstate MSM. ................................................225

Figure 85. Potential of Mean Force (PMF) for the simple potential at (1/KT) a.

0.995, b. 0.652, and c. 0.456. In part a, four metastable macrostates are

separated by the dashed black lines and labled. ............................................228

Figure 86. Populations of four macrostates as function of =1/kT. ................................229

Figure 87. Folding (black) and unfolding (red) rates are plotted as a function of

=1/kT. ..........................................................................................................230

Figure 88. Logarithms of the implied timescales as function of for the 2D

potential are displayed. The three slowest timescales are plotted using

up triangle, down triangle, and cross points respectively..............................231

xxvii

Figure 89. Populations computed from Simulated Temperating (ST) simulations

for four metastable states of the are plotted as a function of length of

the simulation. The reference populaiton is shown in the solid lines and

1000 trajectories are used for this calculaiton. The error bars are the

standard derivation obtained from bootstrapping 100 times with

replacement....................................................................................................232

Figure 90. Populations computed from Adaptive Seeding Method (ASM) for four

metastable states of the are plotted as a function of length of the

simulation. The reference populaiton is shown in the solid lines and

1000 trajectories are used for this calculation. The lag time is selected

as 1/3 of the length of the simulation. The error bars are standard

derivation obtained from a Bayesian method (See section 2.5.3 for

details). ..........................................................................................................233

Figure 91. Populations computed from ASM simulations for four metastable states

as a function of lag time. ...............................................................................234

Figure 92. Number of steps taken to reach the convergence as a function of

number of trajs...............................................................................................235

1

INTRODUCTION

Molecular kinetics plays fundamental roles in human health and disease. For example,

conformational changes in the ribosome drive translation and many drugs work by

inducing allosteric conformational changes in G protein-coupled receptors. Many

neurological diseases, like Alzheimer’s, are also hypothesized to result from protein

misfolding. Therefore, a deeper understanding of molecular kinetics is crucial for our

ability to comprehend and control human health.

Protein folding is a classic grand-challenge in molecular biophysics because it

is such a dramatic example of molecular kinetics and has important medical

implications. With the recent discovery of structured RNAs, RNA folding has also

become of interest. Folding is the process by which a disordered chain of residues

(either amino acids or nucleotides) spontaneously self-assembles into a specific three-

dimensional shape. The fact that folding happens at all is astounding given the

enormous number of possible conformations a protein or RNA can adopt. For

example, a hypothetical protein with 100 residues, each of which could adopt two

conformations, could fold into over 1,000,000,000,000,000,000,000,000,000,000

different structures. If such a protein visited one conformation/second then reaching

all of them would take over 1,000,000,000,000 times longer than the age of the

universe. Moreover, real proteins have many more degrees of freedom and sometimes

many more residues. Despite this, proteins can often fold in a matter of milliseconds to

seconds and RNA folding is only moderately slower. Therefore, it is reasonable to

conclude that there must be one or more pathways guiding a biomolecule to its

native—or most probable—state. Because folding is such a dramatic conformational

change, any method that could map out the pathways by which protein and RNA

molecules fold would likely be a powerful means of understanding less drastic but

equally important structural rearrangements like allostery, all of which fall into the

general category of molecular kinetics. In addition, accurate models for protein folding

2

would serve as a reference point for understanding, preventing, and reversing from

misfolding diseases.

Many experimental techniques have been developed to probe folding.

Unfortunately, biomolecules are extremely sensitive to their underlying chemical

details and no current experimental method can simultaneously describe the atomic

details of a molecule’s thermodynamics and kinetics. For example, x-ray

crystallography can provide atomistic snapshots of a protein’s structure but gives little

information about its kinetics. FRET, on the other hand, can provide information about

a protein’s structure and dynamics by reporting on the changing distance between two

probes attached to a molecule but is blind to the rest of that molecule’s structure.

Heterogeneity also complicates the interpretation of much experimental data.

Molecular dynamics (MD) simulations are a powerful means of simultaneously

modeling a biomolecules thermodynamics and kinetics with atomic resolution. In an

MD simulation, one explicitly represents every atom and the bonds between them.

One can then iteratively update the position and velocity of each atom based on the

force exerted on it by the rest of the simulated system. The resulting trajectory is like a

movie taken by zooming in on a single protein (or some other biomolecule).

Unfortunately, MD has many of its own challenges. First and foremost among

these is the sampling problem. Atomistic MD simulations must take very small

timesteps (on the order of femtoseconds) to avoid unphysical phenomena like atoms

passing through one another. Therefore, a typical computer can only simulate ~5

nanoseconds/day even for a small protein and would take over 500,000 years to

simulate one second. In addition, molecular kinetics are stochastic, so generating a

single long simulation is inadequate for truly understanding processes like protein

folding. Instead, one must witness numerous events to characterize the entire

distribution of pathways by which they can occur. Moreover, even if one could run a

sufficient number of long simulations, the task of analyzing this data and making a

direct connection with experiments would still remain. And, of course, the validity of

3

the results of any simulation depends on the accuracy of the approximations and

parameters (together referred to as the force field) used to describe the interactions

between atoms. Unfortunately, testing a force field requires obtaining sufficient

sampling and comparing the results to a large body of experimental data, so selecting

(or developing) a good force field is non-trivial at best.

Networks called Markov state models (MSMs) are one potential solution to

these problems (1-3). An MSM is essentially a map of a molecule’s conformational

space built from MD simulations. That is, like a road map with cities labeled with

populations connected by roads labeled with speed limits, MSMs give the probability

that a protein or other molecule will be in a certain set of conformations (called a

metastable state) connected by edges describing where it can go next and how quickly.

MSMs are typically constructed from simulation trajectories (3-8). Because of

the temporal relationship between conformations in a trajectory, it is possible to group

conformations that can interconvert rapidly into states and then determine the

connectivity between states by counting the number of times a simulation went from

one state to another. By employing these kinetic definitions, one ensures that the

system’s dynamics can be modeled reasonably well by assuming stochastic transitions

between states (1, 3-6, 9-12). Thus, it is possible to perform analyses, such as

identifying the most probable conformations at equilibrium or modeling the relaxation

of some experimental observable, and make a quantitative comparison to (or

predictions of) experiments. In addition, one can naturally vary the temporal and

spatial resolution of an MSM by changing the definition of what it means to

interconvert rapidly or slowly (4, 5, 10, 13, 14), much like zooming in and out on a

Google map. By choosing a long timescale cutoff, one can obtain humanly

comprehensible models with just a few metastable (or long-lived) states that capture

large conformational changes, like folding. Such coarse-grained models are useful for

gaining an intuition for a system. With a short timescale cutoff, on the other hand, one

can obtain a model with many states. By using such high resolution models, one

sacrifices ease of comprehension for more quantitative agreement with experiments (4,

4

5, 15). Regardless of the resolution, one can also draw on network theory to analyze

MSMs and gain important insights into processes like folding (16, 17). Thus, MSMs

are a powerful way of analyzing simulation data sets.

MSMs also provide a statistical approach to molecular simulation—and

potentially other problems exhibiting metastability (18). Rather than attempting to

generate one realization of an entire process, one instead decomposes conformational

space into multiple metastable states and seeks to gather statistics on each step of the

process independently and in parallel (e.g. by running many short simulations from

each state and then combining them into a single MSM). Adaptive sampling

algorithms for MSM construction take this statistical approach a step further (12, 18-

20). In adaptive sampling, one first obtains an initial model of the entire process of

interest by any means possible. One then iteratively calculates the contribution of

each step of the process to uncertainties in some observable of interest via Bayesian

statistics and runs numerous parallel simulations of the steps that can lead to the

greatest increases in precision until the desired level of statistical certainty is achieved.

Such an approach was recently shown to lead to dramatic reductions in the statistical

uncertainty in the observable of interest relative to other refinement schemes (19).

More recently, we have shown that it leads to efficient improvement of the global

model quality (18). Once a converged sampling is obtained, MSMs at varying

resolutions can be used to asses the validity of the underlying force field by making

quantitative comparisons to existing data and predictions of new experiments.

Therefore, one can gain new insight into processes like protein folding, or at least

understand and correct errors in the force field.

Here we describe how MSMs can be used to understand protein folding

(and related problems in molecular kinetics) and connect to experiments. We begin

with an introduction to MSMs and a software package we developed to automate the

construction of these models from simulation data sets. Next, we describe initial

applications of this software to small model systems (a 35 residue mutant of the villin

headpiece and a 39 residue fragment of NTL9) to test this methodology. We then

5

describe new insights into protein folding obtained from MSMs and their application

to larger, more biologically relevant systems like λ repressor (an 80-residue protein).

This discussion is followed by an explanation of how MSMs can be used to solve the

sampling problem using adaptive sampling and other enhanced sampling algorithms.

Within this discussion of sampling, we also describe some of the initial applications of

MSMs to RNA folding.

6

CHAPTER 1: USING GENERALIZED ENSEMBLE SIMULATIONS AND

MARKOV STATE MODELS TO IDENTIFY CONFORMATIONAL STATES

This chapter was taken from: Bowman GR, Huang X, & Pande VS (2009) Using

generalized ensemble simulations and Markov state models to identify conformational

states Methods 49:197-201.

ABSTRACT

Part of understanding a molecule’s conformational dynamics is mapping out the

dominant metastable, or long lived, states that it occupies. Once identified, the rates

for transitioning between these states may then be determined in order to create a

complete model of the system’s conformational dynamics. Here we describe the use of

the MSMBuilder package (now available at https://simtk.org/home/msmbuilder/) to

build Markov State Models (MSMs) to identify the metastable states from Generalized

Ensemble (GE) simulations, as well as other simulation datasets. Besides building

MSMs, the code also includes tools for model evaluation and visualization.

INTRODUCTION

Molecular Dynamics (MD) and Monte Carlo (MC) computer simulations have the

potential to complement experiments by elucidating the chemical details underlying

the conformational dynamics of biological macromolecules like proteins and RNA.

Such simulations sample a system’s free energy landscape, which is characterized by

long-lived, or metastable, states separated by large free energy barriers. Thus,

understanding a system’s conformational dynamics can be broken down into two

https://simtk.org/home/msmbuilder/

7

steps: 1) identifying the long lived, or metastable, states visited by the system and 2)

determining the rates of transitioning between these states. Unfortunately, it is

extremely difficult to adequately sample the conformational space accessible to

biomolecules. Furthermore, even if adequate sampling can be achieved, the resulting

datasets are often quite large and, therefore, difficult to analyze and interpret.

A popular approach to the first step is to use Generalized Ensemble (GE)

algorithms (21-25) to sample the accessible space and then to generate projections of

the free energy landscape onto some set of order parameters to identify the dominant

thermodynamic states (26-29). GE algorithms, such as the Replica Exchange Method

(REM) (22, 23) and Simulated Tempering (ST) (24, 25), achieve broad sampling at

the temperature of interest by performing a random walk in temperature space. Broad

sampling is possible because an energy barrier that is difficult to cross at the

temperature of interest will be flattened out and, therefore, more easily crossed at

higher temperatures. GE algorithms also maintain canonical sampling at every

temperature. Thus, they are a suitable way to sample the accessible space.

Projections of the free energy landscape onto a few order parameters are

frequently used to make sense of the resulting dataset (26-29). Such projections may

be meaningful if an appropriate set of order parameters is chosen; however, this is

quite difficult so there is always the danger of being misled by projections because

meaningful information along other order parameters may be completely lost (3, 30).

For example, structures that fall within the same basin in some projection may have

little structural or kinetic similarity. Thus, choosing a representative conformation for

that basin may be impossible.

Clustering methods, on the other hand, do not have these issues because the

dominant order parameters do not need to be specified in advance. However, most

clustering algorithms group conformations together based solely on their structural

similarity (31, 32), so they may fail to capture important kinetic properties. To

illustrate the importance of integrating kinetic information into the clustering of

8

simulation trajectories, one can imagine two people standing on either side of a wall.

Geometrically these two individuals may be very close but kinetically speaking it

could be extremely difficult for one to get to the other. Similarly, two conformations

from a simulation dataset may be geometrically close but kinetically distant and,

therefore, a clustering based solely on a geometric criterion would be inadequate for

describing the system’s dynamics.

Here we describe the use of Markov State Models (MSMs) to identify

metastable states in GE datasets, though we note that the MSMBuilder package we

introduce to build MSMs may be applied to any simulation dataset. An MSM may be

thought of as a form of clustering that incorporates kinetic information by grouping

conformations that can interconvert rapidly into the same state and conformations that

cannot interconvert rapidly into different states (3, 6, 9, 11, 33, 34). Thus,

conformations in the same metastable state, which may be thought of as a large free

energy basin, will be grouped together while conformations separated by large free

energy barriers will not.

A biomolecular folding free energy landscape may be thought of as a hierarchy

of basins (35, 36). Since larger basins may contain numerous smaller local minima our

use of the phrase free energy basin above is somewhat ambiguous. To determine what

constitutes a distinct free energy basin an MSM may be represented as a transition

probability matrix where the entry at row i and column j gives the probability of

transitioning from state i to state j during a time Δt, called the lag time. Based on this

matrix one may obtain a series of implied timescales for transitioning between various

regions of phase space and use this information to determine an appropriate number of

metastable states, as explained below. The number of metastable states to be

constructed controls the resolution of the model by determining how large a barrier

must be in order to divide phase space into multiple states.

In the past, MSMs have generally been used to model kinetics and, therefore,

have been built from constant temperature data. For example, MSMs have been used

9

to model numerous small systems (33, 37, 38) and a few larger ones (39, 40). Since

GE simulations perform a random walk in temperature space they do not have

physical kinetics. However, GE simulations contain the desired canonical ensemble

and therefore the desired free energy barriers. These barriers may be flattened or

distorted at higher temperatures but the barriers at the temperature of interest should

still be sufficient to provide the desired separation of timescales. That is, fast intrastate

transitions and slower interstate transitions. Thus, the pseudo-kinetics of GE

simulations are still sufficient to identify the dominant metastable states.

In the following sections we describe the use of the MSMBuilder package

(now available at https://simtk.org/home/msmbuilder/) to identify the dominant

metastable states in GE datasets, though we note the method may be applied as is to

datasets generated with other algorithms and is easily extensible to completely

different problems. There are four major steps in the procedure: 1) dividing the data

into small sets called microstates based on their structural similarity, 2) lumping

kinetically related microstates together into metastable states (also called macrostates),

3) extracting representative conformations for each state, and optionally 4) calculating

populations of each state to judge convergence. Steps 1-3 are depicted schematically

in Figure 1. The conformations extracted with this method represent the space

explored by the system and thus give insights into its dynamics. The pseudo-kinetics

of the GE simulations may give some indication of the connectivity of these states but

cannot give conclusive results due to the random walk in temperature space. However,

this method may serve as a basis for obtaining both accurate thermodynamics and

kinetics (Huang et al. in preparation).


10

Figure 1. Schematic of the steps required for building an MSM and obtaining

representative conformations for each state. First, GE data represented by points

are grouped into microstates represented by circles, with darker circles for more

highly populated microstates. Kinetically related microstates are then lumped

together into macrostates, or metastable states, represented by amorphous shapes.

Finally, representative conformations are obtained by extracting the most probable

conformation from each macrostate.

DESCRIPTION OF METHOD

1. DIVIDING THE DATA INTO MICROSTATES

The first step in building an MSM is to divide the data into thousands of microstates

based on their structural similarity (6). For conformational dynamics we measure

structural similarity by the RMSD for some subset of the atoms. While the RMSD

may not be very meaningful for large distances, it does have a kinetic interpretation

for small distances. That is, conformations with very small RMSDs should be able to

interconvert rapidly. Thus, if a microstate is small enough that every member has a

very small RMSD to every other member then one may assume that their structural

similarity implies a kinetic similarity.

11

However, one must also take care not to generate microstates that are too small

because it is important to see a sufficient number of transitions between them. For

example, if every conformation were put into its own microstate no pair of trajectories

would ever visit the same microstate. Thus, the most meaningful grouping of

microstates would be to group every conformation in the same trajectory together and

no new insight would be gained.

One method for determining an appropriate size for each microstate is to

measure the average RMSD between every pair of temporally adjacent conformations

in each trajectory and to ensure that the diameter of each microstate is no more than

this value (Sun et al. in preparation). Thus, any pair of conformations within a given

microstate will tend to be within one MD step of each other. However, this method

may be overly stringent. We have found that using microstates with an all-heavy-atom

RMSD radius of about 3.0 Å allows us to capture the true equilibrium distribution for

an 8 nucleotide RNA hairpin. Preliminary work in our lab shows that radii on the

order of 2-2.5 Å seem more appropriate for protein systems.

One can use the doFastGromacsClustering executable provided by the

Clusterer component of the MSMBuilder package to divide a dataset into microstates.

At present the Clusterer code is capable of using an approximation of the k-centers

clustering algorithm (41, 42) to divide simulation datasets generated with the Gromacs

software package (43) into some number of microstates. However, it is written in

object oriented C++ code so it is straight forward to add new clustering algorithms,

data types to cluster, distance metrics, and other components.

The approximate k-centers clustering algorithm was chosen as the default

clustering method because it is deterministic, simple, fast, and creates clusters with

approximately equal radii (42). The algorithm works as follows: 1) every point is

initially infinitely far from any cluster center, 2) choose an arbitrary point as the first

cluster center, 3) compute the distance between every point and the new cluster center,

4) assign points to this new cluster center if they are closer to it than the cluster center

12

they are currently assigned to, 5) declare the point that is furthest from every cluster

center to be the next new cluster center, and 6) repeat steps 2-5 until the desired

number of clusters have been generated. Thus, the algorithm has complexity O (kN)

where k is the number of clusters to be generated and N is the number of data points to

be clustered. An order of magnitude speedup is also made possible by using the

triangle inequality to avoid unnecessary distance computations (Sun et al. in

preparation). This fast version of the algorithm is used by default, though the original

version described above is also available. Besides the cluster definitions, this program

also gives the radius of each microstate and the average and standard deviation of the

RMSD from every member of the microstate to the cluster center.

The arbitrary starting point used by this approximate k-center clustering

algorithm would be of some concern for small k or if the microstates were our primary

interest. However, we have found that the clustering results are insensitive to the

starting point for large k (e.g. k > 1000). In addition, we are mainly concerned with the

macrostates generated by lumping kinetically related microstates together. The

lumping algorithm described in the next section is fairly insensitive to the exact

boundaries between microstates as long as each microstate is sufficiently small, so the

arbitrary starting point is acceptable for building MSMs.

An attractive feature of this approximate k-centers algorithm is that it yields

clusters of approximately equal volume (as judged by using the maximal distance

between the cluster center and any other point in the cluster as the radius of a sphere)

(42). This property is of value because it means that the population of a cluster is

approximately proportional to its density in phase space. However, we note that

exploiting this interpretation requires some caution as it is unclear how to compute

exact volumes in a high dimensional phase space and, therefore, difficult to measure

densities in phase space precisely. Regardless, this property also allows the boundaries

between metastable states to be well-resolved. Clustering algorithms that do not have

this guarantee may create large clusters in sparse regions of phase space and small

clusters in dense regions. The large clusters in sparse regions of phase space are prone

13

to violate the assumption that conformations within a microstate are kinetically

related. Therefore, various conformations in the microstate may be most kinetically

related to different metastable states, in which case it will be unclear which macrostate

to group the microstate with.

2. LUMPING MICROSTATES INTO MACROSTATES

Conceivably, one could extract a representative conformation for each microstate to

get an idea of the conformational space explored by the system of interest. However,

this would only be a slight improvement upon examining the raw data itself. Instead, it

is valuable to lump kinetically related microstates together into metastable states, also

called macrostates. The tools for lumping together microstates, as well as for

extracting representative conformations and determining state populations, may be

found in the PythonTools component of the MSMBuilder package.

The first step in generating a set of macrostates is to determine how many of

them to create (6). This task may be accomplished with the

BuildMSMsAsVaryLagTime.py script. This script builds a microstate MSM for

each of a series of lag times. A microstate MSM is just a transition probability matrix

where the entry in row i and column j is the probability that a simulation will be in

microstate j at time t+Δt given that it was in state i at time t. A series of implied

timescales are then calculated and printed to a file for each microstate MSM based on

the eigenvalues of the transition probability matrix. These implied timescales

correspond to the timescales for transitioning between different sets of microstates. An

appropriate number of macrostates to build can be determined based on the location of

the major gap in the implied timescales, which should correspond to the largest

separation of timescales within the system. The implied timescales for multiple lag

times are examined because the location of this gap is normally sensitive to the lag

time. Ideally the implied timescales will level out as the lag time increases (34) and

obvious gaps that are robust with respect to the lag time will be apparent, as indicated

in Figure 2. An appropriate number of macrostates is then one more than the number

14

of implied timescales above the major gap (3, 6). In non-ideal cases the number of

implied timescales above the gap will not level off. In such cases we recommend

erring on the side of having too many macrostates rather than too few. If too many

macrostates are generated then some of the representative conformations may be

redundant (only separated by small barriers), whereas if too few are constructed

important regions of phase space may not be identified.

Figure 2. Implied timescales as a function of the lag time. There are two probable gaps in the implied

timescales. If gap one were selected then a macrostate MSM with four states would be constructed

whereas if gap two were selected a higher resolution MSM with 6 states would be constructed.

A macrostate MSM with the appropriate number of states may then be built

using the BuildMacroMSM.py script. First, this script uses the Perron Cluster

Cluster Analysis (PCCA) algorithm (44, 45) to lump together kinetically related

microstates. The PCCA algorithm identifies kinetic relationships based on the

eigenvalue/eigenvector structure of the microstate MSM and will not be described in

detail here. This initial lumping is then refined using simulated annealing to maximize

the metastability (6), which is defined as

15

N

i

iiTQ1

),( (1)

where N is the number of macrostates and T is the macrostate MSM transition

probability matrix. In words, the metastability is the sum of the self-transition

probabilities of each macrostate. Thus, the metastability may range from 0 to N.

Maximizing the metastability is a heuristic for maximizing the separation of

timescales (6). During each simulated annealing step a randomly selected microstate is

reassigned to a randomly selected macrostate, the resulting change in metastability is

calculated, and the move is either accepted or rejected based on the Metropolis

criterion.

We recommend using a lag time of one step to build the MSM to maximize the

use of all the data. The resulting state definitions and a longer lag time may then be

used to obtain populations and transition rates. A lag time within the implied timescale

gap should yield a strongly Markovian model. That is, one with a sufficiently large

separation of timescales that the assumption that the state at time t+Δt depends only on

the state at time t is valid.

The main outputs of the BuildMacroMSM.py script are a mapping from

microstates to macrostates and the metastability of this lumping. The mapping from

microstates to macrostates may be used to determine which macrostate each data point

is in using the WriteMacroAssignments.py script or the

doFastGromacsAssign program. In general, the

WriteMacroAssignments.py script should be used as it is faster. Both methods

allow the user to specify a temperature range and will only print out assignments for

conformations within this range. This feature is useful for calculating populations of

states at a given temperature. The mapping may also be used by the

getMacroStateCenters program to get information about each macrostate, such

as the most geometrically central microstate and the average and the standard

deviation of the RMSD between that microstate’s center and the center of every other

16

microstate in that macrostate. Such information is useful for getting an idea of the size

of each macrostate.

3. EXTRACTING REPRESENTATIVE CONFORMATIONS

There are a number of ways of extracting representative conformations for each

macrostate. A simple way of getting a single conformation is to use the

getMacroStateCenters program as discussed above. However, one must

remember that conformations selected in this manner represent the geometric center of

each macrostate and not necessarily the most probable member of each macrostate.

To understand the distribution of conformations in each macrostate one may

identify the central conformation of each microstate in a given macrostate using the

GetMicroCentersByMacroState.py script. The conformations for a given

macrostate may then be overlaid in a viewer for visual analysis. Such an approach may

be cumbersome if there are too many microstates in each macrostate. One alternative

is to randomly select a reduced number of conformations from each macrostate using

the GetRandomConfsFromEachState.py script. A major shortcoming of these

methods is that they select conformations with a more or less uniform distribution

across the macrostate.

Probably the best way of extracting representative conformations is to use the

GetDensityInfo.py script. This script outputs a list of the microstates in each

macrostate ordered from densest to sparsest. That is, the most probable to the least

probable. Any number of the most probable structures in a given macrostate may then

be selected and overlaid in a viewer to get an idea of the distribution of conformations

within the state.

4. JUDGING CONVERGENCE

Unfortunately there is no analytic way of checking that a single set of simulations has

explored the entire accessible space for a given system and, therefore, yielded

17

representative conformations that accurately describe the conformational dynamics.

To the best of our knowledge, the most effective way to ensure that the entire space

has been explored is to run two distinct sets of simulations started from very different

initial configurations. The populations for each state may then be calculated for each

dataset. If they agree then one can be relatively sure that the entire space has been

explored because the thermodynamics found are independent of the starting

conformation.

One practical consideration is that the same state definition must be used for

both datasets because it is unclear how to compare different MSMs. A common state

definition may be obtained by building a single MSM based on both datasets. The

WriteMacroAssignments.py script or doFastGromacsAssign program

may then be used to independently assign each dataset to this common state definition,

preferably restricting the assignments to the temperature range of interest for GE

datasets, so that the population of each state may be determined. Of course, due to the

stochastic nature of conformational dynamics the two sets of populations are unlikely

to agree exactly. To make a valid comparison the GetMacroMSMPopStats.py

script may be used to obtain error bars on the populations from each dataset. This

script uses a bootstrapping algorithm to approximate the variation in the populations.

If the populations agree within error then the two simulations may be considered to

have converged to the true equilibrium distribution and one may be relatively sure that

the entire accessible space has been explored. Thus, the conformations extracted in

step 3 will provide an accurate depiction of the conformational dynamics of the

system.

CONCLUSIONS

Using the MSMBuilder to analyze GE simulations and other datasets will allow

researchers to quickly map out the conformational space explored by biological

macromolecules like RNA, which is the first step to understanding conformational

18

dynamics. The MSMBuilder may also be used to determine the rates of transitioning

between states in microcanonical and canonical simulations, resulting in a complete

Markov state model for the system’s conformational dynamics. While more

sophisticated algorithms for building MSMs exist (6), they are not likely to provide

much improvement for analyzing GE datasets due to the distortion resulting from high

temperature data. The highly extensible object oriented design of the code should

allow such algorithms to be incorporated easily for use with other datasets though.

Incorporating other data types, clustering methods, distance metrics, and analysis tools

should also be straight forward. In particular, this software serves as a foundation for

automating adaptive sampling algorithms (19), which promise to allow the maximal

use of one’s computing resources by focusing sampling on regions of uncertainty.

Finally, the results of applying this method to GE datasets may be used as a basis for

determining the rates of transitioning between states (Huang et al. in preparation),

thereby giving a complete picture of a system’s dynamics.

19

CHAPTER 2: PROGRESS AND CHALLENGES IN THE AUTOMATED

CONSTRUCTION OF MARKOV STATE MODELS FOR FULL PROTEIN

SYSTEMS

This chapter was taken from: Bowman GR, Beauchamp KA, Boxer G, & Pande VS

(2009) Progress and challenges in the automated construction of Markov state models

Journal of Chemical Physics 131:124101.

ABSTRACT

Markov State Models (MSMs) are a powerful tool for modeling both the

thermodynamics and kinetics of molecular systems. In addition, they provide a

rigorous means to combine information from multiple sources into a single model and

to direct future simulations/experiments to minimize uncertainties in the model.

However, constructing MSMs is challenging because doing so requires decomposing

the extremely high dimensional and rugged free energy landscape of a molecular

system into long-lived states, also called metastable states. Thus, their application has

generally required significant chemical intuition and hand tuning. To address this

limitation we have developed a toolkit for automating the construction of MSMs

called MSMBuilder (available at https://simtk.org/home/msmbuilder). In this work we

demonstrate the application of MSMBuilder to the villin headpiece (HP-35 NleNle),

one of the smallest and fastest folding proteins. We show that the resulting MSM

captures both the thermodynamics and kinetics of the original molecular dynamics of

the system. As a first step towards experimental validation of our methodology we

show that our model provides accurate structure prediction and that the longest

timescale events correspond to folding.

https://simtk.org/home/msmbuilder

20

INTRODUCTION

For a molecular system, the distribution of conformations and the dynamics between

them is determined by the underlying free energy landscape. Thus, the ability to map

out a molecule’s free energy landscape would yield solutions to many outstanding

biophysical questions. For example, structure prediction could be accomplished by

identifying the free energy minimum (46), leading to insights into catalytic

mechanisms of proteins that are difficult to crystallize. Intermediate states, such as

those currently thought to be the primary toxic elements in Alzheimer’s disease (47),

could also be identified by locating local minima. As a final example, protein folding

mechanisms could be understood by examining the rates of transitioning between all

the relevant states.

Unfortunately, the free energy landscapes of solvated biomolecules are

extremely high dimensional and there is no analytical means to identify all the relevant

features, especially when one is concerned with molecules in which small molecular

changes yield significant perturbations of the system, such as amino acid mutations in

proteins. Therefore, a theoretical treatment requires sampling the potential, generally

using Monte Carlo (MC) or Molecular Dynamics (MD), and then inferring

information about the states in the free energy landscape from the sampled

configurations. Moreover, if one is interested in kinetic properties, one must go further

and sample kinetic quantities (e.g. rates) of interconversion between these

thermodynamic states.

Mapping out a molecule’s free energy landscape can be broken down into

three stages: 1) identifying the relevant states and, in particular, the native state, 2)

quantifying the thermodynamics of the system, and 3) quantifying the kinetics of

transitioning between the states. Each of these stages builds upon the preceding stages.

In fact, this hierarchy of objectives is evident in the literature. For example, in the

structure prediction community it is common to plot the free energy as a function of

the RMSD to the native state (48). Such representations allow researchers to quickly

21

assess whether or not their potential accurately captures the most experimentally

verifiable state, the native state. However, they provide little information on the

presence of other states, their relative probabilities, or the kinetics of moving between

them (49). Projections of the free energy landscape onto multiple order parameters, on

the other hand, may capture multiple states and their thermodynamics (30, 49). The

main limitation of these representations is that they depend heavily upon the order

parameters selected (30). If the order parameters are not good reaction coordinates,

then important features may be distorted or even completely obscured (30, 50).

Furthermore, barring the selection of a perfect set of reaction coordinates, such

projections only yield limited information about the system’s kinetics due to loss of

information about other important degrees of freedom (51).

Clustering techniques are a promising means of overcoming these limitations

as they allow the automatic identification of the relevant degrees of freedom (52).

However, most clustering techniques are based solely on geometric criteria (31, 32) so

they may fail to capture important kinetic properties. To illustrate the importance of

integrating kinetic information into the clustering of simulation trajectories, one can

imagine two people standing on either side of a wall. Geometrically these two

individuals may be very close but kinetically speaking it could be extremely difficult

for one to get to the other. Similarly, two conformations from a simulation dataset may

be geometrically close but kinetically distant and, therefore, a clustering based solely

on a geometric criterion would be inadequate for describing the system’s dynamics.

Markov State Models (MSMs) fit nicely into this progression as they provide a

natural means to achieve a complete understanding of a molecule’s free energy

landscape—a map of all the relevant states with their correct thermodynamics and

kinetics (3, 6, 9, 10, 53). The critical distinction between MSMs and other clustering

techniques is that an MSM constitutes a kinetic clustering of one’s data (3, 6, 9, 10).

That is, conformations that can interconvert rapidly are grouped into the same state

while conformations that can only interconvert slowly are grouped into separate states.

Such a kinetic clustering ensures that equilibration within a state, and therefore loss of

22

memory of the previous state, occurs more rapidly than transitions between states. As

a result, the model satisfies the Markov property—the identity of the next state

depends only on the identity of the current state and not any of the previous states.

MSMs are better able to capture the stochastic nature of processes like protein

folding than traditional analysis techniques, allowing more quantitative comparisons

with and predictions of experimental observables. Thus, they will allow researchers to

move beyond the traditional view of MD simulations as molecular microscopes. An

MSM also provides a natural means of varying the resolution of one’s model. For

example, consider a protein folding process that occurs on a 10 μs timescale. Using a

cutoff of one ns to distinguish a fast transition from a slow one would yield a high

resolution model that may be difficult to interpret by eye. Using a cutoff of one μs,

however, would likely yield a high-level model capturing the essence of the process in

a human readable form. MSMs provide a rigorous means to combine data from

multiple sources and can be used to extract information about long timescale events

from short simulations (11, 54, 55). Finally, there are a number of ways of exploiting

MSMs to minimize the amount of computation that must be performed to achieve a

good model for a given system (12, 19, 20).

Unfortunately, constructing MSMs is a difficult task because it requires

dividing the rugged and high dimensional free energy landscape of a system into

metastable states (6). A good set of states will tend to divide phase space along the

highest free energy barriers. More specifically, none of the states will have significant

internal barriers. Such a partitioning ensures the separation of timescales discussed

above—intrastate transitions are fast relative to interstate transitions—and, therefore,

that the model is Markovian. States with high internal barriers break the separation of

timescales and introduce memory. To illustrate this situation, imagine a state divided

in half by a single barrier that is higher than any barrier between states. Besides

breaking the separation of timescales by causing transitions within this state to be slow

relative to transitions between states, trajectories that enter the state to the left of the

internal barrier will also tend to leave to the left while trajectories that enter on the

23

right will tend to leave to the right. Thus, the probability of any possible new state will

depend both on the identity of the current state and the previous state, breaking the

Markov property. Avoiding such internal barriers has generally required a great deal

of chemical insight and hand tuning (33, 39); thus, the application of MSMs has been

limited.

To facilitate the more widespread use of MSMs we have developed an open

source software package called MSMBuilder that automates their construction (now

available at https://simtk.org/home/msmbuilder) (10). MSMBuilder builds on previous

automated methods (6) by incorporating new geometric and kinetic clustering

algorithms. It also provides a command-line interface built on top of an object oriented

structure that should allow for the rapid incorporation of new advances. In summary,

MSMBuilder works as follows: 1) group conformations into very small states called

microstates and assume the high degree of structural similarity within a state implies a

kinetic similarity, 2) validate that this state decomposition is Markovian, and

optionally 3) lump the microstates into some number of macrostates based on kinetic

criteria and ensure that this macrostate model is Markovian. There are also a number

of tools for analyzing and visualizing the model at both the microstate and macrostate

levels.

In this work we demonstrate that MSMBuilder is able to construct MSMs for

full protein systems in an automated fashion by applying it to the villin headpiece

(HP-35 NleNle) (56, 57). Unlike the peptides that have been studied with automated

methods in the past (6), villin has all the hallmarks of a protein, such as a hydrophobic

core and tertiary contacts. It is also fast folding, so it is possible to carry out

simulations on timescales comparable to the folding time (58).

Our hope is that this work will serve as a guide for future users of

MSMBuilder. Thus, we will discuss failed models, the insights these models gave us,

and how these insights led to the final model. We will also discuss some of the

remaining limitations in the automated construction of MSMs. In addition, we will

https://simtk.org/home/msmbuilder

24

demonstrate that our model yields accurate structure prediction and that the longest

timescales correspond to folding. However, our main emphasis will be on the

methodology of building MSMs that faithfully represent the raw simulation data. In

particular, we will focus on the microstate level as this is the finest resolution and

bounds the performance of lower resolution models. The full biophysical implications

of the model and their relation to experimental results will be discussed more

thoroughly in a later work.

MATERIALS & METHODS

SIMULATION DETAILS

The data set used in this study was taken from Ensign et al. (58) and is described

briefly below. It consists of ~450 simulations ranging from 35 ns to 2 μs in length and

is publicly available at the SimTK website (https://simtk.org/home/foldvillin).

First, the crystal structure (PDB structure 2F4K) (56) was relaxed using a

steepest descent algorithm in GROMACS (43, 59) using the AMBER03 force field

(60). The resulting structure was placed in an octahedral box of dimensions 4.240

nm×4.969 nm×4.662 nm and solvated with 1306 TIP3P water molecules. Nine 10 ns

high temperature simulations (at 373 K), each with different initial velocities drawn

from a Maxwell–Boltzmann distribution, were run from this solvated structure. The

final structures from each of these unfolding simulations were then used as the initial

points for ~450 folding simulations at 300 K.

Folding simulations were preceded by 10 ns equilibration simulations at

constant volume and the protein coordinates fixed. For all MD simulations, the

SHAKE (61) and SETTLE (62) algorithms were used with the default GROMACS 3.3

parameters to constrain bond lengths. Periodic boundary conditions were employed.

To control temperature, protein and solvent were coupled separately to a Nosé–

Hoover thermostat (63, 64) with an oscillation period of 0.5 ps. The system was

https://simtk.org/home/foldvillin

25

coupled to a Parrinello–Rahman barostat (65, 66) at 1 bar, with a time constant of 10

ps, assuming a compressibility of 4.5×10−5 bar−1. Velocities were assigned randomly

from a Maxwell–Boltzmann distribution. The linear center-of-mass motion of the

protein and solvent groups were removed every ten steps. A cutoff at 0.8 nm was

employed for both the Coulombic and van der Waals interactions. During these

simulations, the long-range electrostatic forces were treated with a reaction field

assuming a continuum dielectric of 78, and the van der Waals was treated with a

switch from 0.7 nm to 0.8 nm. The neighborlist was set to 0.7 nm for computational

performance.

MARKOV STATE MODEL CONSTRUCTION

All the MSMs used in this paper were constructed with MSMBuilder (10), the relevant

components of which are reviewed below. A significant modification of the code was

the introduction of sparse matrix types, which allows the construction of MSMs with

many more states than previously possible by making more efficient use of the

available memory. Sparse matrices will be included in the next release of

MSMBuilder.

CLUSTERING

An approximate k-centers clustering algorithm was used to generate the microstates in

all the MSMs used in this study (41, 42). The algorithm works as follows: 1) choose

an arbitrary point as the first cluster center, 2) compute the distance between every

point and the new cluster center, 3) assign points to this new cluster center if they are

closer to it than the cluster center they are currently assigned to, 4) declare the point

that is furthest from every cluster center to be the next new cluster center, and 5)

repeat steps 2-4 until the desired number of clusters have been generated. The

computational complexity of this algorithm is O(kN) where k is the number of clusters

and N is the number of data points to be clustered. The algorithm is intended to give

clusters with approximately equal radii, where the radius of a cluster is defined as the

26

maximum distance between the cluster center and any other data point in the cluster.

Given that MD simulations are Markovian (9), it should be possible to generate

a Markov model for simulation dynamics by constructing sufficiently small (or

numerous) states. However, the size of a given data set will limit how many clusters

can be generated because reducing the number of conformations in each state will

eventually result in an unacceptable level of statistical uncertainty.

Based on the Boltzmann relationship, we can calculate the free energy of a

state as – kTlog (p), where p is the probability of being in the state. Though small

variations in the radii of microstates may imply quite large variations in their volumes

due to the high dimensionality of the phase space of biomolecules, empirically we find

that assuming the clusters have equal volume is useful. In particular, we find that

interpreting lower free energy microstates as having higher densities and evaluating

models based on the correlation between the free energy and RMSD of each

microstate agrees with other measures of the validity of an MSM, such as implied

timescales plots as discussed below. Because this relationship is not guaranteed to

hold the correlation between microstate free energy and RMSD should never be used

as the sole assessment of a model. As discussed in the Results & Discussion, it is quite

useful for identifying potential shortcomings of a given model. These issues are not a

concern at the macrostate level.

All clustering in this work was based on the heavy-atom RMSD between pairs

of conformations. However, we note that pairs of atoms in the same side chain that are

indistinguishable with respect to symmetry operations were excluded from the RMSD

computations.

Representative conformations from some clusters are shown using VMD(67).

TRANSITION PROBABILITY MATRICES

Transition probability matrices are at the heart of MSMs (9). Row normalized

transition probability matrices are used in this study. The element in row i and column

27

j of such a matrix gives the probability of transitioning from state i to state j in a

certain time interval called the lag time (τ).

The transition probability matrix serves many purposes. For example, a vector

of state probabilities may be propagated forward in time by multiplying it by the

transition probability matrix.

)()()( Ttptp (1.1)

where t is the current time, τ is the lag time, p(t) is a row vector of state probabilities at

time t, and T(τ) is the row normalized transition probability matrix with lag time τ.

The eigenvalue/eigenvector spectrum of a transition probability matrix gives

information about aggregate transitions between subsets of the states in the model and

what timescales these transitions occur on (9). More specifically, the eigenvalues are

related to an implied timescale for a transition, which can be calculated as

)ln(

k (1.2)

where τ is the lag time and μ is an eigenvalue. The corresponding left eigenvector

specifies which states are involved in the aggregate transition. That is, states with

positive eigenvector components are transitioning with those with negative

components and the degree of participation for each state is related to the magnitude

of its eigenvector component (9).

IMPLIED TIMESCALES PLOTS

Implied timescales plots are one of the most sensitive indicators of whether or not a

model is Markovian (34). These plots are generated by graphing the implied

timescales of an MSM for a series of lag times. If the model is Markovian at a certain

lag time then the implied timescales should remain constant for any greater lag time.

The minimal lag time at which the implied timescales level off is the Markov time, or

28

the smallest time interval for which the model is Markovian. The implied timescales

for a non-Markovian model tend to increase with the lag time instead of leveling off.

Unfortunately, increasing the lag time decreases the amount of data and, therefore,

increases the uncertainty in the implied timescales. Thus, implied timescales plots can

be very difficult to interpret.

In this study error bars on implied timescales plots were obtained using a

bootstrapping procedure. Five randomly selected subsets of the available trajectories

were selected with replacement and the averages and variances of the implied

timescales for each lag time were calculated.

TIME EVOLUTION OF OBSERVABLES

The time evolution of the mean and variance of any molecular observable can be

calculated from an MSM. Calculating the time evolution of an observable X requires

calculating the average of X in each state i (Xi) and the average of X2 (Xi2). In this

study we took averages over five randomly selected conformations from each state.

An initial state probability vector may then be propagated in time as in Equation 1.1.

At each time step the mean and variance can be calculated as

(1.3)

N

iii XtpX

1

)(

(1.4) 222 XX

where N is the number of states, pi (t) is the probability of state i at time t, σ is the

standard deviation and

(1.5) N

iii XtpX 22 )(

29

RESULTS & DISCUSSION

AN INITIAL MODEL

Given the computational cost of running extensive MD simulations an important

consideration in constructing an MSM is to maximize one’s use of the available data.

Of course, one’s hardware always sets hard upper limits on the amount of data that

may be used at each stage of building an MSM. In particular, it may not always be

possible to fit all of the available conformations into memory for the initial clustering

phase of constructing an MSM with MSMBuilder. A convenient way of overcoming

this bottleneck is to use a subset of the available data to generate a set of clusters. Data

that was left out during the clustering phase may then be assigned to these clusters.

To maximize the use of our data while satisfying the memory constraints of

our system we first sub-sampled our dataset by a factor of 10 and clustered the

resulting conformations into 10,000 states. Snapshots were stored every 50 ps during

our MD simulations, which will henceforth be referred to as the raw data. Thus, the

effective trajectories used during our clustering consisted of snapshots separated by

500 ps. The remaining 90% of the data was subsequently assigned to this 10,000 state

model. Fortunately, it is possible to parallelize this assignment phase because the

cluster definitions are never updated after the initial clustering.

As discussed in the introduction, the first criterion for assessing the validity of

our model is whether or not it is capable of capturing the native state. The next

criterion is whether or not the thermodynamics of the model are correct. An initial

assessment of these two criteria may be obtained from a scatter plot of the free energy

of each state as a function of the RMSD of the state center from the native state.

There is some correlation between the free energy of a microstate and the

RMSD of its center from the crystal structure in this model, as shown in Figure 3A.

However, the most native-like RMSD of any of the state centers is 4.15 Å whereas the

simulations reach conformations with RMSD values as low as 0.52 Å. This

30

discrepancy is a first indication that there may be significant heterogeneity within the

states of this model. In particular, more near-native conformations must have been

absorbed into one or more other states. Highly heterogeneous states are likely to

violate the assumption that the degree of geometric similarity within a microstate

implies a kinetic similarity, preventing the construction of a valid MSM. This

conclusion is supported by the fact that the average distance between any

conformation and the nearest cluster center is over 4.5 Å.

31

Figure 3. Scatter plots of the free energy of each microstate (in kcal/mol) versus its RMSD. A) The

initial 10,000 state model, B) the 30,000 state model, C) the final 10,000 state model, and D) the

final 10,000 state model except that the average RMSD across five structures in each state is used

instead of the RMSD of the state center.

Final confirmation of the imperfections of the current 10,000 state model

comes from examining the implied timescales as a function of the lag time. If the

division into microstates were fine enough to ensure the absence of any large internal

barriers the largest implied timescales should be invariant with respect to the lag time

for any lag time greater than the Markov time(34). Figure 4 shows that the implied

timescales for this model continue to grow monotonically as the lag time is increased.

While the growth is not too severe it should be possible to improve upon this model

given the amount of sampling in the dataset.

Figure 4. Top ten implied timescales for the initial 10,000 state model.

Besides the structural and kinetic heterogeneity within states, the

monotonic growth of the implied timescales may also be due to the low number of

counts in some states and the resulting uncertainty in transition probabilities from

these states. For example, there are less than 10 data points in over 100 of the states at

32

the smallest lag time. Even for a state with ten data points no transition probability can

be resolved beyond a single significant digit. Increasing the lag time will reduce the

number of data points in every state, having particularly deleterious effects on

estimates of transition probabilities from states with low counts in the first place.

MORE STATES ARE NOT ALWAYS BETTER

As a first attempt at improving our original model we increased the number of states

from 10,000 to 30,000. Our objective in doing so was to avoid internal barriers by

dividing phase space into smaller states. In addition, we hoped to find more near

native states by pulling low RMSD conformations into their own clusters.

Clustering the data into more states did indeed result in more near-native

states, as shown in Figure 3B. The most native-like state center in the 30,000 state

model has an RMSD of 3 Å and there is still a general correlation between low free

energy and low RMSD. The average distance between any conformation and its

nearest state center was also reduced from 4.5 Å to 3.5 Å.

However, increasing the number of states also had some negative effects on the

model. In the 10,000 state model about 1% of the states had 10 or less conformations

in them whereas in the new 30,000 state model 6% of the states have 10 or fewer

conformations. Thus, the uncertainty in the transition probabilities from many states

will be greater. In addition, while increasing the number of states did create a handful

of more near-native states, it also more than doubled the number of states with an

RMSD over 10 Å. These phenomena are consistent with the fact that the approximate

k-centers clustering algorithm used in this work tends to create clusters with

approximately equal radii (41, 42). When adding more clusters, this property will tend

to result in most of the new clusters appearing in large sparse regions of phase space in

the tails of the distribution of conformations. As a result of these shortcomings, the

30,000 state model was found to have monotonically increasing implied timescales

33

similar to those for the 10,000 state model and, therefore, is not significantly more

Markovian than the previous model (data not shown).

DISREGARDING OUTLIERS DURING CLUSTERING YIELDS A MARKOVIAN MODEL

One approach to dealing with outliers would be to use all the data during the clustering

phase and then discard those clusters that behave in unphysical ways, such as clusters

that act as sinks. However, such an approach could discard legitimate trapped states.

In addition, the tendency of our approximate k-centers algorithm to select outliers as

cluster centers could easily result in a large fraction of clusters being discarded.

To deal with the limitations of our clustering algorithm we reverted to using

10,000 states and increased the amount of sub-sampling at the clustering stage from a

factor of 10 to a factor of 100, which is equivalent to using trajectories with

conformations stored at a 5 ns interval for this data set. This change compensates for

the tendency of our approximate k-centers algorithm to select outliers as cluster

centers by reducing the number of available data points in the tails of the distribution

of conformations at the clustering stage. Thus, increasing the degree of sub-sampling

at our clustering stage focuses more clusters in dense regions of phase space where

more of the relevant dynamics are occurring. The remaining data can then be assigned

to these clusters, so no data is thrown out entirely. Incorporating the remaining data in

this manner will tend to enlarge clusters on the periphery of phase space because they

will absorb data points in the tails of the distribution of conformations. More central

clusters, on the other hand, will tend to stay approximately the same size. The number

of data points in every cluster should increase though, allowing better resolution of the

transition probabilities from each state.

A very simple kinetically inspired clustering scheme could be implemented by

sub-sampling to select N evenly spaced conformations (in time) as cluster centers. In

this case a large number of clusters would appear in dense regions of phase space

while there would be very few clusters in sparse regions. Our current approach is an

34

intermediate between such a kinetically inspired clustering and the purely

geometrically defined clustering used in our first two models. It is intended to have

some of the strengths of both approaches—i.e. fine resolution everywhere as in the

geometric approach but even more so in dense regions of phase space as in the kinetic

approach.

In fact, sub-sampling more at the approximate k-centers clustering stage and

then assigning the remaining data to these clusters does improve the structural,

thermodynamic, and kinetic properties of the model. Based on our experience with

this data set and a few others (RNA hairpins and small peptides, data not shown) a

good starting point is to sub-sample such that 10N conformations are used to generate

N clusters and conformations used during the clustering are separated by at least 100

ps. The remaining data should then be assigned to these clusters. The degree of sub-

sampling and number of clusters may then be adjusted to improve the model as

necessary as the optimal parameters will depend on the system. In particular, the

optimal strategy may be quite different for much smaller or larger systems.

Structural agreement: Figure 3C shows that our new model has state centers

with RMSDs as low as 3.4 Å, which is somewhat higher than the 30,000 state model

but better than the original model. Examination of randomly selected structures from a

number of states revealed that the microstate center is not always a good

representative of the state. In particular, some near-native states have a dense pocket

of very low RMSD conformations and a handful of outliers. In such cases our

approximate k-centers clustering algorithm will select a conformation in between the

dense pocket of low RMSD states and the outliers (41) when really a structure from

the denser region would be more representative of the state. A further improvement in

the structural characterization of the model is made possible by calculating the average

RMSD over five randomly selected conformations from each state instead of just the

state center, as shown in Figure 3D. This analysis reveals that the most native-like

state has an average RMSD of about 1.8 Å. To illustrate the agreement between this

state and the crystal structure Figure 5A shows an overlay of three randomly selected

35

conformations from this state with the crystal structure. An interesting future direction

would be to further validate near-native states by comparing them directly with the

experimental data rather than the model thereof.

Figure 5. Three representative structures for A) the lowest RMSD state in the final model and B) the

most probable state in the final model overlaid with the crystal structure (red). The phenylalanine

core is shown explicitly for each molecule.

36

Thermodynamic agreement: As discussed in the introduction, we cannot

calculate the equilibrium distribution of villin analytically so we do not have an

absolute reference point to judge our model against. However, there are some

promising features of the thermodynamics of the model that lend it credibility. The

most populated state has about 4% of the total population and has an average RMSD

of 2.3 Å. Figure 5B illustrates the agreement between three random conformations

from this state and the crystal structure. The state with the lowest average RMSD also

has the fifth highest population, which is about 2% of the total population, and about

12% of the conformations are in states with average RMSD values less than 3 Å.

There is also a reasonable correlation between the RMSD and the free energy, as

shown in Figure 3D. Our results seem to be robust with respect to the method used for

calculating the equilibrium distribution as well, as discussed in Appendix A. Finally,

the populations from the MSM are consistent with those from averaging over the raw

data in successive windows of the simulation time, indicating that the MSM

thermodynamics are in agreement with the underlying potential if not experiment (data

not shown).

Here it is important to note that none of the simulations were started from the

native state. While this is not formally a blind prediction (since the crystal structure

has been previously reported (57)), it is promising that so many simulations folded

under the given potential, allowing one to not merely reach the folded state but predict

its structure ab initio. It will be interesting to see if this procedure can yield similar

results in a blind prediction, or at least when structural criteria are not used as a basis

for adjusting the model as in this work.

Kinetic agreement: Another promising feature of this model is that there are no

fewer than 12 data points in every state, indicating that this model may be able to

better resolve the transition probabilities for most states. In fact, the implied timescales

for this model do seem to level off as the lag time is increased. Figure 6A shows that

the longest timescales level off at a lag time of about 15 ns but increase moderately at

longer lag times. Figure 6B, however, shows that the implied timescales are level

37

within error from 15 to 60 ns. After about 35 ns there is an increase in the statistical

uncertainty in the implied timescales, explaining their apparent growth in Figure 6A.

After 60 ns the statistical uncertainty becomes enormous so implied timescales beyond

this point are not shown. Thus, this model appears to be Markovian at lag times of 15

ns and beyond.

38

Figure 6. Top ten implied timescales for the final model. A) The implied timescales at intervals of one

ns. B) The implied timescales with error bars obtained by doing five iterations of bootstrapping at

an interval of five ns.

The longest implied timescale for this model is about 8 μs. While this is quite

long relative to the experimentally predicted folding time of 720 ns at 300 K (56), it is

consistent with previous simulation work suggesting that the experimental

measurements may be monitoring structural properties which relax faster than the

complete folding process (58). In that study, the authors found that a surrogate for the

experimental observable was consistent with the experimental measurements but that

longer timescales on the order of 4 μs were present when monitoring the relaxation of

a more global metric for folding. Ensign et al. also found timescales as high as ~50 μs

by applying a maximum likelihood estimator to a subset of the data with little folding.

While this timescale is much longer than any of the implied timescales in our MSM, it

is not inconsistent with our model because the rates for transitioning between some

states in an MSM, when fit using a two-state kinetics assumption, may be slower than

the implied timescales. Ensign et al. likely identified one of these slow rates by

focusing on a subset of the data. For a more detailed discussion of this topic with a

simple example see Appendix B.

The components of the left eigenvector corresponding to the longest timescale

give information about what is occurring on this timescale. That is, states with positive

eigenvector components are interchanging with states with negative components and

the degree of participation in this aggregate transition is given by the magnitude of the

components (9). Figure 7 demonstrates that the longest timescale in our model does

correspond to folding by showing that it corresponds to transitions between high and

low RMSD states. Numerous states do not participate strongly in this transition,

explaining the streak of points with eigenvector components near zero.

39

Figure 7. The average RMSD of each state in the final model versus its left eigenvector component in

the longest timescale transition showing that this transition corresponds to folding.

For further confirmation that the MSM is an accurate model of the simulation

data we compared the predicted time evolution of the population of the native state

with the raw simulation data, where the native state was defined as all microstates with

an average Cα RMSD to the crystal structure less than 3 Å. Figure 8 shows that there

is good agreement between the MSM and raw data.

40

Figure 8. Comparison between the time evolution of the native population in the MSM (blue) and the

raw data (black) for the entire dataset. The error bars represent the standard error.

While the time evolution of state populations is a good test of our MSM, often

we will want to compute the time evolution of some observable to make comparisons

with and predictions of experiments. As an example we compare the predicted time

evolution of the Cα RMSD to the actual time evolution of the RMSD in the raw data

for each of the nine initial configurations. The means by which we calculated the

RMSD from the MSM is described in the Methods section. Measuring the time

evolution of the RMSD from the raw data is simply a matter of measuring the average

RMSD over the simulations started from the given initial structure at every time point.

We also included a reduced representation of the raw data in this comparison. In the

reduced representation each trajectory is represented as a series of states rather than a

series of conformations. The average RMSD at a given time point is then calculated by

averaging the RMSD of the states each of the relevant trajectories is in. It is important

to note that we used the average RMSD across five randomly selected conformations

(and the variance thereof) for each state rather than the RMSD of the state centers in

41

these comparisons. Just using the RMSD of the state centers resulted in poor

comparisons since they are not truly representative of the state, as discussed above.

Very good agreement (i.e. within the uncertainties of the observables) was

found between all three representations for seven of the nine starting configurations,

an example of which is shown in Figure 9A. In these cases the MSM was found to

capture both the mean and variance of the time evolution of the RMSD to high

precision. The agreement was less strong for the two remaining starting

conformations, as shown in Figure 9B. In these cases the reduced representation

agreed well with the raw data, showing that our states are structurally sufficient to

capture the correct behavior. The mean RMSD from the MSM does not agree as well

with the other two representations, though the true mean is still within the variance of

the prediction from the MSM. Note that this variance, as well as al the other variances

shown in Figure 9, are just due to the variance in the RMSD within each state and do

not include any of the statistical uncertainty in the model. Their large magnitude is an

indication of the heterogeneity of villin folding.

42

Figure 9. Comparison between the time evolution of the RMSD in the MSM (blue), the reduced

representation (yellow), and the raw data (black) for A) an example of good agreement and B) an

example of the worst case scenario. The error bars represent one standard deviation in the RMSD.

The discrepancy between the MSM predictions and the other two

representations for two of the starting structures indicates that our model still has some

subtle memory issues in a subset of the states. Interestingly, the two conformations

43

where the MSM agreed less well with the raw data were found to be faster folding

than the other seven initial configurations in a previous study(58). It would appear that

the slower folding trajectories are dominating the equilibrium distribution, causing all

the MSM predictions to level off at about 6 Å, which is too high for the two fast

folding initial configurations. Similar results were found with other observables, such

as the distance between the Trp23 and His27 residues that was previously used as a

surrogate for the experimental observable used to measure the folding time(58) (data

not shown).

REMAINING ISSUES

The most probable cause of any subtle memory issues in our model is the existence of

internal barriers within some states. As discussed previously, a state with a sufficiently

high internal barrier could cause transition probabilities from that state to depend on

the identity of the previous state. In particular, simulations started from one initial

configuration could tend to enter and exit a state in one way while simulations started

from a different initial configuration could tend to enter and exit the same state in a

completely different way.

To test for the existence of internal barriers we calculated independent MSMs

for each initial configuration. Each of these MSMs used the same state definitions,

however, only simulations started from the given starting conformation were used to

calculate the transition probabilities between states. All of these models agreed well

with the raw data. For example, Figure 10 shows good agreement for the starting

structure previously used as an example of the poorest agreement between the full

model and the raw data (shown in Figure 9B).

44

Figure 10. Improved agreement between the MSM and raw data for the example of poor agreement

from Figure 6B obtained by building the transition probability matrix from simulations started from

this starting structure alone. The error bars represent one standard deviation in the RMSD.

This improved agreement indicates that some states do indeed have internal

barriers. Moreover, the seven conformations for which the full model best reproduced

the raw data probably have the same behavior in these states while the two initial

configurations with poorer agreement between the full MSM and the raw data have a

different behavior in these states. The discrepancy then occurs because transition

probabilities for these states in the full MSM will be a weighted average of the two

types of behavior. The two starting conformations that contribute less heavily to this

weighted average are then captured less well by the full MSM.

In an attempt to address this problem we tried increasing the number of states

to 30,000. This model may have had some structural advantages and given a slightly

lower Markov time, however, it still suffered from the same subtle memory issues as

the 10,000 state version (data not shown). Models with even more states were not

attempted as they would greatly increase the number of states with very few counts

and, therefore, increase uncertainty in the model. These issues may be resolved by

45

identifying those states with internal barriers and splitting them further. However, such

hand-tuning is beyond the scope of this work, which focuses on the performance of

automated procedures for constructing MSMs.

CONCLUSIONS

Our analysis of the villin headpiece shows that the automated construction of MSMs

using MSMBuilder is now at a point where it can be applied to full protein systems, a

step beyond the small peptides that have been studied in the past(6, 68). This advance

was made possible by the proper application of our approximate k-centers clustering

algorithm. A naïve application of this algorithm to a molecular simulation dataset may

result in a mediocre state decomposition because outliers in sparse regions of phase

space are likely to be selected as cluster centers. To compensate for this tendency, one

can sub-sample at the clustering stage, effectively disregarding many of the outliers

and focusing the clusters in more relevant regions of conformational space. Data not

included in the clustering phases may then be assigned to the resulting model to

maximize the use of the available data. General guidelines for applying this result are

given in Section C of the Results & Discussion.

To demonstrate that our MSM is a reasonable map for villin’s underlying free

energy landscape, we showed that it is capable of accurate structure prediction and its

thermodynamics and kinetics are consistent with the raw simulation data. Thus, we

have laid a foundation for implementing an automated adaptive sampling scheme

capable of constructing models with the minimum possible computational cost. The

fact that our model captures both the mean behavior and heterogeneity of villin folding

will also allow for more accurate comparisons with experiments and predictions of

other experimental observables in a future work on the biophysics of villin folding. By

applying this methodology to multiple systems we hope to understand general

principles of protein folding. Of course, there is still room for improvement. Future

work on estimating reversible transition matrices from simulation data, clustering,

46

adaptive sampling, and exploring the connections between MSMs and Transition Path

Sampling (TPS)(33, 69) could extend the accuracy and applicability of MSMBuilder.

47

CHAPTER 3: MOLECULAR SIMULATION OF AB INITIO PROTEIN FOLDING

FOR A MILLISECOND FOLDER NTL9(1-39)

This chapter was taken from: Voelz VA, Bowman GR, Beauchamp KA, & Pande VS

(2010) Molecular simulation of ab initio protein folding for a millisecond folder

NTL9(1-39). J Am Chem Soc 132:1526-1528.

ABSTRACT

To date, the slowest-folding proteins folded ab initio by all-atom molecular dynamics

simulations have had folding times in the range of nanoseconds to microseconds. We

report simulations of several folding trajectories of NTL9(1-39), a protein which has a

folding time of ~1.5 milliseconds. Distributed molecular dynamics simulations in

implicit solvent on GPU processors were used to generate ensembles of trajectories

out to ~40 µs for several temperatures and starting states. At a temperature less than

the melting point of the forcefield, we observe a small number of productive folding

events, consistent with predictions from a model of parallel uncoupled two-state

simulations. The posterior distribution of the folding rate predicted from the data

agrees well with the experimental folding rate (~640/sec). Markov State Models

(MSMs) built from the data show a gap in the implied time scales indicative of two-

state folding, and heterogeneous pathways connecting diffuse mesoscopic substates.

Structural analysis of the 14 out of 2000 macrostates transited by the top ten folding

pathways reveals that native-like pairing between strands 1 and 2 only occurs for

macrostates with pfold > 0.5, suggesting β12 hairpin formation may be rate-limiting.

We believe that using simulation data such as these to seed adaptive resampling

simulations will be a promising new method for achieving statistically converged

descriptions of folding landscapes at longer time scales than ever before.

48

INTRODUCTION

A complete understanding of how proteins fold, i.e. self-assemble to their biologically

relevant “native state,” remains an unattained goal (70). Computer simulation,

validated by experiment, is a natural means to elucidate this. There is over a million-

fold range in folding rates, suggesting a possible diversity in mechanisms between

slow and fast folding proteins (71). Very fast (microsecond timescale) folding proteins

(56, 72) appear to fold via a large number of heterogeneous, parallel paths (58, 73,

74), potentially key for folding on such fast timescales. Does the folding of much

slower proteins change this picture?

To date, the slowest-folding proteins folded ab initio by all-atom molecular

dynamics simulations with fidelity to experimental kinetics have had folding times in

the range of nanoseconds to microseconds. These include the designed mini-protein

Trp-cage (~4.1 µs) (75), the villin headpiece domain (~10 µs) (76), a fast-folding

variant of villin (<1 µs) (58), and Fip35 WW domain (~13 µs) (77). In this

communication, we report simulations of several folding trajectories, each from fully

unfolded states, of the 39-residue protein NTL9(1-39), which experimentally has a

folding time of ~1.5 milliseconds (78).

MATERIALS & METHODS

Trajectories were simulated via the Folding@Home distributed computing platform

(79) at 300K, 330K, 370K and 450K from native, extended, and random-coil

configurations using an accelerated version of GROMACS written for GPU

processors (80), for an aggregate time of 1.52 ms. GPUs play a key role here, allowing

for dramatically longer trajectories than previously possible. The AMBER ff96

forcefield (60) with the GBSA solvation model (81) was used, a combination

previously shown to give good results folding Fip35 WW domain (77), and shown to

exhibit a good balance of native-like secondary structure for a set of small helical and

beta sheet peptides studied by replica exchange (82).

49


PREDICTION OF AB INITIO FOLDING AND FOLDING RATES

We find that the native state (taken from the N-terminal domain of the crystal structure

of ribosomal protein L9 (83)) is stable in this forcefield at 300K, exhibiting decreasing

stability with increasing temperature (Figure 11a). RMSD-C distributions after 10 µs

show well-defined native and collapsed unfolded basins near 3Å and 5Å, respectively.

Of the ~3000 trajectories started from unfolded (extended and coil) states at 370K

(Figure 11b), two reach an RMSD-C < 3.5Å and eight reach an RMSD-C < 4Å. No

productive folding trajectories were observed at lower temperatures, consistent with

the enhanced forward folding rate expected by Arrhenius kinetics. Higher temperature

trajectories (450K) exceed the melting temperature of NTL9 in the forcefield.

The observed number of folding events n is consistent with expectations from

a simple model of parallel uncoupled folding simulations (84) in which folding is

modeled as a two-state Poisson process: <n> = ∫M(t)k exp(-M(t)kt)dt, where M(t) is the

number of simulations that reach time t (Figure 11b) and k is the experimental folding

rate (~640/sec) (78). This theory predicts (on average) ~1.8 folding trajectories for the

amount of sampling performed, in agreement with the two folding trajectories found in

practice. Posterior distributions of folding rates given the amount of simulation time

and number of folding trajectories were computed using a Bayesian approach (85),

which yield expectation values within an order of magnitude of the experimental

folding rate.

50

Figure 11. (a) Distributions of RMSD-C for native-state simulations of NTL9(1-39) after 10 µs. The

arrows indicate thresholds defined for the native basin at 3.5Å and 4Å. (b) The number of parallel

simulations M(t) started from unfolded states at 370K that reach time t. (c) Posterior predictions of

the folding rate given the amount of simulation time and observed folding events for 3.5Å (dashed)

and 4Å (solid) thresholds, using uniform (black) and Jeffrey’s (gray) priors, using methods from

(85). In red is a Gaussian distribution representing the experimental rate mean and standard

deviation.

In addition to native-like conformations, we see near-native configurations,

which show heterogeneity in hydrophobic packing, most notably in alternative side

chain arrangements in the beta-sheet structure (Figure 12). Most common of these is a

non-native hydrophobic core involving residues I4, I18 and I37 (which normally

contact the C-terminal helix in the full-length protein) with F5 solvent-exposed.

INSIGHT INTO FOLDING MECHANISMS

In order to describe the kinetics and mechanistic aspects of folding, we employ a new

paradigm for sampling the global free energy landscape of folding, using Markov

State Models (MSMs). MSM approaches, by automatically identifying a set of

kinetically metastable states (such as foldons (86)) and efficiently sampling transitions

between these states, can model long-timescale kinetics from much shorter trajectories

(3, 6, 37, 54).

Our strategy for simulating slow-folding proteins is first to generate an initial

series of kinetically connected states from both the folding and unfolding directions,

and then to use adaptive resampling techniques (12) to produce statistically converged

estimates of metastable basins and the transition rates between them. In the remainder

of this communication, we report progress toward the first goal, by constructing an

MSM from the entire set of 370K trajectory data (4, 10), which we will use to seed

future rounds of transition sampling. While additional rounds of adaptive sampling

could likely aid in increasing the quantitative power of this model, there are several

notable observations which can be made with the current data set.

51

Figure 12. (a) A snapshot from a folding trajectory (dark blue) achieves an RMSD-C of 3.1Å

compared to the native state (cyan). (b) Non-native (top) and native-like (bottom) hydrophobic core

arrangements observed in low-RMSD conformations of folding trajectories. Highlighted are

sidechains of residues F5 (magenta), V3,V9,V21 (tan), and L30,L35 (pink).

Key to accurately identifying metastable states is the clustering of trajectory

conformations into microstates fine-grained enough to be used for lumping into

groups of maximally metastable macrostates (10). 100,000 microstate clusters were

calculated using an approximate k-centers algorithm (42), each with an average radius

of 4.5Å RMSD-backbone. Lag times ranging from 1 to 32 ns were used to build a

series of MSMs. The implied time scales predicted by these models (obtained by

diagonalizing the rate matrix) show a clear spectral gap separating the slowest

relaxation time scale from the rest, indicative of single-exponential kinetics (see

Figure 52). The implied time scale of the model levels off beyond a lag time of ~10 ns

to an implied time scale of ~1 ms, close to the experimental folding time.

An important strength of MSMs is their ability to gain insight at coarser scales

by “lumping” the kinetic transitions into a simpler model with fewer states. To gain a

mesoscopic view of the folding free energy landscape, we lumped our 100,000-

microstate MSM into a 2000-macrostate model. In this view, we find that the

metastable states are diffuse collections of conformations over which multiple possible

folding pathways can occur, indicating a vast heterogeneity of folding substates that

need to be understood in greater detail. At the same time, we can identify highly

52

populated “native” (state n) and “unfolded” (state a) macrostates that dominate the

observed relaxation rates (Figure 13 and Figure 53).

The ten pathways with the highest folding flux from macrostate a to n were

calculated by a greedy backtracking algorithm (see Appendix C) from the macrostate

transition matrix using transition path theory (5, 87) (TPT). The diversity of pathways

demonstrates the power of the MSM approach: although we observe only a few

folding trajectories directly, a network of many possible pathways can be inferred

from the overlapping sampling of local transitions.

While NTL9(1-39) folds quickly for a two-state folder, it is similar in size to

many ultrafast (sub-millisecond) folders that appear to exhibit so-called “downhill”

folding. Hence, we would like to understand the structural features that limit the

overall folding rate. As in a macroscopic two-state model, the highest-flux pathways

in our mesoscopic model are amn and aln direct routes from disordered to

structured macrostates, reminiscent of nucleation-condensation. These pathways by

themselves, however, account for only ~10% of the total flux, and the structural

diversity seen in all pathways is reminiscent of more hierarchical folding models such

as diffusion-collision. Thus, we sought to more fully study the 14 macrostates

transited by the top ten folding pathways.

Figure 13. A 2000-state Markov State Model (MSM) was built using a lag time of 12 ns. Shown is the

superposition of the top 10 folding fluxes, calculated by a greedy backtracking algorithm (see

Appendix C). These pathways account for only about 25% of the total flux, and transit only 14 of

53

the 2000 macrostates (shown labeled a-n, for convenient discussion). The visual size of each state

is proportional to its free energy, and arrow size is proportional to the inter-state flux.

To examine structural changes along the folding reaction, we considered three

main native structural elements: the central helix (), the pairing of strands 1 and 2

(12), and the pairing of strands 1 and 3 (13). To quantify the extent of native-like

structuring for each of these elements we calculated QQ12 and Q13, respectively

(see Appendix C for details). The Q-value is a number between 0 and 1 that quantifies

the extent of native-like contacts. We then examined, for each macrostate, the Q-

values in relation to the pfold value (committor), a kinetic reaction coordinate. The pfold

value is computed from the macrostate transition matrix (5, 37, 87).

This analysis yields several key insights into the folding mechanism of

NTL9(1-39) on the mesoscale. We find the “unfolded” state a is compact, and

contains a baseline level of residual native-like structure, with Q near 0.5, and Q12

and Q13 near 0.2. In general, across the 14 macrostates studied, Q-values increase as

pfold values increase, although the relative balance of QQ12 and Q13 varies,

indicating pathway heterogeneity: i.e. native-like structures can form in different

orders (Figures 14, Figure 55, Figure 56). An exception to this, however, is observed

for 12 strand pairing. Only for macrostates with pfold > 0.5 (states g-n) does

appreciable 12 strand pairing occur (Figure 15). This suggests that the formation of a

local strand pair (12), rather than a nonlocal strand pair (13), is rate-limiting. This

effect is not predicted by strictly topological models of folding in which loop closure

entropy loss dominates (88), but instead may result from sequence-specific details.

Unlike the 13 strand pair, which has a small interaction surface stabilized by

hydrophobic contacts, the 12 hairpin contains seven of the protein’s eight lysine

residues, and three of its five glycine residues in a flexible loop region, features which

may imbue 12 with larger barriers to folding. This proposed role of 12 is also

consistent with the large changes in kinetics and stability seen experimentally for

mutations in the 12 hairpin (78).

54

Figure 14. The 14 macrostates involved in the top ten folding pathways, plotted along structural and

kinetic reaction coordinates. The balance between native-like helix and sheet structure is quantified

by Qα – (Qβ12 + Qβ13)/2 (vertical axis), and progress along the folding reaction is quantified by the

pfold (committor) value (horizontal axis). It can be seen that the “unfolded” state (a) contains

residual native-like helical propensity, and that pathways involving various ordering of native-like

helix and sheet formation are possible.

Figure 15. Q-values, which capture the extent of native-like structures, plotted versus pfold (committor)

values. The lines are to guide to eye.

It is natural to compare our results with previous unfolding simulations of

NTL9(1-39) K12M by Snow et al. (89). In that work, a detailed characterization of the

55

transiti

The above results suggest that existing forcefield models using implicit solvent are

ough to fold proteins ab initio at long time scales (milliseconds),

on state ensemble required the definition of strand-pairing reaction coordinates

corresponding to 12 and 13 formation. In our MSM analysis, no such pre-definition

is required. Snow et al. also note the difficulty in resolving kinetic intermediates not

captured by the chosen order parameters. Indeed, our structural analysis can resolve

subtle kinetic intermediates within the native basin, corresponding to alternative

rearrangements of the 12 hairpin loop (Figure 57).

CONCLUSIONS

indeed accurate en

opening the door to simulating more structurally complex proteins. Moreover, our

work demonstrates that there need not be a single pathway or single, dominant

mechanism for the folding of a given protein: since the theories proposed for how

proteins fold are based on broadly relevant physical principles, it is natural to imagine

that multiple mechanisms could be simultaneously present, but that the sequence of

the protein, coupled with the chemical environment would control the balance to

which each mechanistic pathway is seen.

56

CHAPTER 4: PROTEIN FOLDED STATES ARE KINETIC HUBS

This chapter was taken from: Bowman GR & Pande VS (2010) Protein folded states

are kinetic hubs. Proc Natl Acad Sci U S A 107:10890-10895.

ABSTRACT

Understanding molecular kinetics, and particularly protein folding, is a classic grand

challenge in molecular biophysics. Network models, such as Markov State Models

(MSMs), are one potential solution to this problem. MSMs have recently yielded

quantitative agreement with experimentally derived structures and folding rates for

specific systems, leaving them positioned to potentially provide a deeper

understanding of molecular kinetics that can lead to experimentally testable

hypotheses. Here we use existing MSMs for the villin headpiece and NTL9, which

were constructed from atomistic simulations, to accomplish this goal. In addition, we

provide simpler, humanly comprehensible networks that capture the essence of

molecular kinetics and reproduce qualitative phenomena like the apparent two-state

folding often seen in experiments. Together, these models show that protein dynamics

are dominated by stochastic jumps between numerous metastable states and that

proteins have heterogeneous unfolded states (many unfolded basins that interconvert

more rapidly with the native state than with one another) yet often still appear two-

state. Most importantly, we find that protein native states are hubs that can be reached

quickly from any other state. However, metastability and a web of non-native states

slow the average folding rate. Experimental tests for these findings and their

implications for other fields, like protein design, are also discussed.

57

INTRODUCTION

Molecular kinetics has fascinated biophysicists and biochemists for decades. From a

biophysical point of view, it remains a mystery how systems with so many possible

configurations can self-organize with such specificity and rapidity, carry out catalysis,

and trigger signaling cascades. From a biomedical standpoint, protein misfolding

causes many debilitating diseases, including Alzheimer’s, Huntington’s, and

Parkinson’s diseases (90). Understanding how proteins fold is a logical first step in

understanding how they misfold and, more importantly, how to prevent or recover

from misfolding; indeed, this approach is already proving valuable (40). Furthermore,

a better understanding of protein folding mechanisms could lead to more efficient

structure prediction (91, 92), for use in high throughput proteomics and studies of

systems that defy experimental characterization, and better models for molecular

kinetics could aid in computational drug and protein design.

What would the ultimate theory of molecular kinetics look like though? A

natural way of answering this question is by analogy to well established theories, such

as Schrodinger’s equation in the successful field of quantum mechanics. On the one

hand, computational solutions to Schrodinger’s equation have yielded quantitative

agreement with and prediction of experimental observables. However, equally

important is this theory’s ability to yield insight into simple systems, such as the

particle in a box, for the purposes of gaining an intuition for fundamental principles,

like the quantization of energy and the role of molecular orbitals. Likewise, the

ultimate theory of molecular kinetics should be capable of scaling from sophisticated

models capable of quantitatively predicting experiments to simple models which yield

mechanistic insight. At even the most fundamental levels of this hierarchy, such a

theory ought to be at least qualitatively consistent with experimental observations and

be capable of generating experimentally testable hypotheses. In particular, such a

theory ought to provide insight into protein folding as success in describing such

drastic conformational changes would be evidence for the theory’s ability to describe

less extreme ones.

58

We propose that networks of metastable, or long-lived, states (4, 9, 33, 55)

could fulfill this role because they are implicit in even the most simple protein folding

models; examples include U↔N and U↔I↔N where U is the unfolded state, I is an

intermediate, and N is the native state. Networks called Markov State Models (MSMs)

make these implicitly considered properties explicit and have the potential to provide

complete maps of a protein’s free energy landscape, with nodes corresponding to

metastable states (or free energy basins) and edges representing the probabilities of

transitioning between pairs of these states (3, 4, 6, 9, 33, 55).

A number of recent works have provided validation for these networks by

showing that they can yield quantitative agreement with experimentally derived

structures and folding rates (4, 5, 12, 93). In particular, the predicted native state from

our villin model (based on calculated free energies) had an RMSD to the crystal

structure of ~1.8 Å (4). The model also correctly predicted quantitative details of the

kinetics, such as the absolute folding rate (to logarithmic accuracy). This degree of

accuracy in predicted free energies, structures, and rates is crucial as all experimental

measurements are functions of these properties. In all, the agreement between theory

and experiment leads us to the conclusion that our models provide a sufficiently

accurate reflection of reality.

To further flesh out this potential theory of molecular kinetics, we have delved

into the nature of the free energy landscapes of the villin headpiece (HP-35 NleNle)

(56) and a 39 residue fragment of NTL9 (78). Furthermore, because complex networks

for real systems are difficult to comprehend, we construct simple, generic models that

capture qualitative phenomena like apparent two-state folding and provide an intuition

for molecular kinetics. Together, these models allow us to assess existing theories,

which describe folding as a two-state process characterized by cooperative transitions

across a dominant free energy barrier separating a rapidly mixing unfolded ensemble

from the native state (94, 95).

59

The remainder of this paper will be organized around three key results. First,

protein free energy landscapes can yield apparent two-state behavior even in the

absence of a single dominant barrier. Second, protein unfolded states are

heterogeneous, having multiple basins that interconvert more rapidly with the native

state than one another. Third, protein native states are kinetic hubs: it is possible to

reach them relatively quickly from anywhere in a network but it is also possible to get

stuck in a web of non-native states.


APPARENT TWO-STATE BEHAVIOR CAN OCCUR IN THE ABSENCE OF A KINETICALLY

RELEVANT TWO-STATE DECOMPOSITION.

Many proteins appear to fold via a single cooperative transition from a rapidly mixing

ensemble of unfolded conformations to a well defined native structure (94, 96).

However, based on chemical intuition, one would expect to find many more

metastable states, corresponding to the numerous favorable interactions that could

form in the absence of the full native structure as well as dynamics within the native

state. To reconcile these points, one typically assumes a single dominate free energy

barrier that serves as the rate limiting step for folding. Other barriers are often

assumed to be small relative to the thermal energy (or at least to the dominant barrier)

and the equilibrium probability of any intermediate is assumed to be too small to

detect.

However, in some cases modeling experimental data requires the use of at least

three states (97-99) and simple toy models have shown that even three-state systems

can yield apparent two-state behavior (100). Thus, it is natural to hypothesize that

many systems may have more complex arrangements of metastable states (9, 10, 101)

yet still exhibit apparent two-state behavior.

60

To test this hypothesis, we first turn to an MSM for the villin headpiece. This

MSM was recently built from atomistic simulations and, by assuming stochastic jumps

between its states, was shown to give quantitative agreement with experimental

structures and folding rates in addition to recapitulating the raw simulation data (4).

Thus, the presence of numerous metastable states in this model would be strong

evidence for their actual existence and the stochastic nature of transitions between

them. Indeed, with a lagtime on the order of 10ns, analysis of this MSM reveals the

existence of at least 500 metastable states. At least 2,000 are found for NTL9 (93).

The free energy barriers between our villin states have an average height of about 5.9

(+/- 2.5) kT (see Appendix D for details), indicating that they are non-trivial and

potentially detectable. Moreover, no single dominant barrier is apparent.

To better understand the system specific results from our all-atom models, we

now consider three simple models for dynamics capable of providing insight into

protein folding in general. Each of these networks has six metastable states and is

depicted in Figure 16. These models have a single folding pathway (S), parallel

folding pathways (P), and a heterogeneous unfolded state (H, with multiple unfolded

basins that each interconvert more rapidly with the native state than with one another)

as discussed in the Materials & Methods section.

61

Figure 16. Three representative networks each having unfolded state(s) (U and Ui), intermediates (Ii),

and a native state (N). S has a single pathway, P has parallel pathways, and H has a heterogeneous

unfolded state.

One may be tempted to associate the states in these models with folding nuclei

(102), pre-organized secondary structure (103), foldons (104), or the elements of some

other model of protein folding (53). However, we simply require that they all be

metastable. That is, a system within one state is more likely to stay there than to

transition to a different state. Moreover, we propose that the concept of metastability

unifies many of the previously proposed folding mechanisms, each of which describes

some systems better than others, as all consist of basic units that are stable on some

timescale.

We can now imagine monitoring stochastic transitions within each of these

representative systems (or ensembles thereof) with a device that can only detect the

native state. This hypothetical setup is equivalent to experiments wherein unfolded

molecules are allowed to relax to an observable folded state where they are trapped to

prevent unfolding and refolding. Figure 17 shows that such an experiment yields the

exponential behavior typical of an ideal two-state system. In fact, exponential fits to

the data after the initial lag phase only give slight underestimates of the true Mean

First Passage Times (MFPTs) between the unfolded and folded states (Table 1). Thus,

even these simple systems are qualitatively consistent with both stochastic jumps

between numerous metastable states and apparent two-state behavior. This is

particularly surprising for model H since it cannot be divided into a single, rapidly

mixing unfolded basin separated from the native state by one dominant barrier (i.e. it

is not two-state).

62

Figure 17. Distributions of the first folding times for the simple networks S, P, and H are shown in

panels A, B, and C respectively. The blue lines are exponential fits to the data after the initial lag

phase.

A kinetic perspective on our simple networks helps to explain why two-state

behavior is often observed even when there are many large barriers. As discussed

previously, when there is a single dominant rate then faster transitions will tend to be

lost in the noise. Multiple slow rates will also be lost in the noise if they are too

similar. Moreover, this same logic applies even when there are multiple folding routes

from different starting points (and thus no kinetically relevant two-state

decomposition). Thus, observing anything other than mainly single exponential

kinetics requires a delicate balance wherein the slowest rates differ sufficiently to

distinguish them but not so much that one dominates the rest, not to mention

extremely precise measurements.

Fortunately, there is ample evidence that achieving this balance and the

precision necessary to detect it are possible. Multi-exponential behavior is often

consistent with the experimental data, but fit to stretched exponentials (105, 106).

Increasing the temporal resolution of single molecule pulling experiments has also

steadily revealed more metastable states and kinetic measurements can be probe

dependent (107, 108). We propose that the ability to simultaneously monitor multiple

degrees of freedom (such as extension and FRET) in single molecule experiments

63

would reveal even more metastable states, particularly if MSMs were used to choose

the number of probes employed and their placement.

PROTEINS HAVE HETEROGENEOUS UNFOLDED STATES WITH MULTIPLE BASINS THAT

INTERCONVERT MORE RAPIDLY WITH THE NATIVE STATE THAN EACH OTHER.

We now investigate which of the simple network topologies is most representative of

real protein free energy landscapes. As a first step, we have calculated that every state

can reach the native basin of our villin model in one or two steps. This eliminates the

possibility of a single pathway since states with that topology could require up to 499

steps to reach the native basin.

Determining whether the parallel pathway model (95, 109, 110) or the

heterogeneous unfolded state model is more representative of villin requires a

definition of the unfolded state(s). Since every non-native state can reach the native

basin in one or two steps it is natural to label every state that is not directly connected

to the native state (332 in all) as unfolded and all other non-native states (167 in all) as

intermediates.

Taking this definition, we can now examine the distribution of MFPTs from

each unfolded state to the native state as well as the distribution of MFPTs between all

pairs of unfolded states. Doing so reveals that the average MFPT to the native state is

880 (+/-270) nanoseconds, in reasonable agreement with the experimentally predicted

folding time of 720 nanoseconds (56). Moreover, this value is much lower than the

average MFPT between pairs of unfolded states (~370 microseconds), as shown in

Figure 18A and 18B. Considering every non-native state as part of the unfolded

ensemble also gives similar distributions (Figure 59), implying that these results are

robust to the exact definition of the unfolded state. Similar results are found for NTL9

as well (Figure 60). Thus, we can conclude that the heterogeneous unfolded state

model is most representative of our villin and NTL9 models and possibly proteins in

64

general. This result is in contrast to existing theories of protein folding, which assume

rapid equilibration within the unfolded ensemble (95, 111, 112).

Figure 18. Relaxation of villin from 500 state model. Distributions of the MFPTs from (A) unfolded

states to the native state and (B) between unfolded states. (C) Relaxation kinetics with a 10:1

signal-noise ratio (black curve with Gaussian noise) and a single exponential fit (blue curve with

τ≈810 ns).

Examination of representative structures suggests that non-native interactions

(often in the context of relatively compact conformations) and the enormity of

conformational space are responsible for slow transitions between unfolded basins

(Figure 61). Non-native contacts can easily have free energies on the order of native

contacts, making non-native states reasonably metastable. Once a set of non-native

contacts is broken, the probability of forming a particular set of other non-native

contacts is quite small due to the large number of other possibilities. This small

probability is equivalent to a slow rate. In contrast, evolutionary pressure to fold

makes transitioning to the native state reasonably probable, which equates to fast

folding relative to slower transitions between unfolded basins.

The tight distribution of MFPTs to the native state is also consistent with our

explanation of apparent two-state behavior. Due to experimental noise, it is difficult to

justify using more than one or two exponentials to fit the relaxation of our coarse-

grained villin model with 500 states, as shown in Figure 18C. Only with an extremely

high signal to noise ratio can one accurately identify the deviations from single

65

exponential relaxation shown in Figure 62. We also note that more fine-grained

models for villin can capture the burst phase in its relaxation (Figure 63) but here we

emphasize the ability of our coarse-grained model to capture the apparent two-state

behavior that dominates this system’s relaxation (56).

Our ability to reconcile our model with existing experimental data on the

nature of the unfolded ensemble (specifically under native conditions, as opposed to

the more rapidly mixing denatured state) indicates that more experiments will be

required to definitively falsify or support our conclusions. For example, Nettels et al.

have reported a 50 ns global relaxation time within the unfolded ensemble (113). Our

model, however, would suggest that this may be due to relaxation within individual

unfolded basins, not between them. This hypothesis is consistent with recent

measurements of slow dynamics in the unfolded ensemble from the Lapidus lab (114,

115). Therefore, we suggest that this may be an interesting direction for future

experimental work. In addition to existing methodologies for probing the unfolded

ensemble, single molecule experiments monitoring multiple degrees of freedom could

help to falsify or support our conclusions.

If our heterogeneous unfolded state model is indeed generally true then protein

folding kinetics cannot be accurately described by two-states separated by a single

barrier. Instead, folding must be understood in terms of multiple pathways starting

from a number of distinct states. Mixing between pathways adds another layer of

complexity to the folding process. Modeling the effects of mutations will thus require

considering changes in the relative free energies of numerous states and barrier

heights. Understanding the global effects of small changes on networks will likely also

be important for protein design.

66

A NATIVE HUB ALLOWS RAPID FOLDING BUT PROTEINS CAN STILL GET STUCK IN A WEB

OF NON-NATIVE STATES.

The accessibility of villin’s native state implies the hub-like connectivity characteristic

of small-world and scale-free networks (116, 117). We can test this hypothesis by

counting the number of connections observed between states because only those

transitions with probabilities above some threshold are observed with our finite

sampling (all transitions would be observed with infinite sampling). Examining

subsets of the states independently, one finds that the average degree (or number of

connections) increases as one moves from the unfolded states to the native basin. The

unfolded states have an average degree of 12 while the intermediate states have an

average degree of 25. The native state acts as a hub, connected to 167 other states.

Similar results are found for a small β-sheet peptide (17) and NTL9.

Reduced connectivity between non-native states results in slow dynamics

within the unfolded ensemble. This connectivity contradicts other models, which

predict bottlenecks close to the native state and high connectivity in non-native

regions (95, 110, 112, 118), as depicted in Figure 19A. A more thorough discussion of

the similarities and differences between our model and those proposed previously is

given in the next section.

67

Figure 19. Schematic diagrams of funnel and native hub models having unfolded states (U),

intermediates (I), and native states (N). (A) A network description of a folding funnel with nodes

corresponding to individual conformations and a bottleneck near the native state. (B) A native hub

model with metastable nodes. The size of each node in (B) is correlated with its equilibrium

probability and the connectivity falls off as one moves away from the native state.

The native hub explains how villin folds so quickly. Just as there are only

about six degrees of separation between people in the US (119), it is possible to reach

68

villin’s native state in one or two jumps (each 15 ns). Therefore, it is possible to fold

from anywhere in the landscape in 30 ns or less. This result is consistent with recent

experimental work showing that the transition path time between the unfolded and

native ensembles can be as much as four orders of magnitude faster than the average

folding time (120) and likely results from evolutionary pressure to fold quickly.

Due to the kinetic proximity of the native state with a 15 ns lagtime, we see

that villin can fold in just 30 ns; however, such trajectories are rare because the

metastability and connectivity of non-native states makes taking a direct route to the

native state improbable. Instead, villin will often spend considerable time in a web of

non-native states before finally folding, resulting in an average folding time on the

microsecond timescale. In the future, it will be interesting to test whether slower

folding proteins have unfolded states further from the native one or just more strongly

metastable states, which equates to higher barriers and slower transitions between

states. Preliminary analysis of NTL9 suggests every basin can reach the native state in

5 steps (~100 nanoseconds) or less.

We have also found a rough correlation between the connectivity of states and

their equilibrium probabilities. The average probabilities of unfolded and intermediate

states are ~0.0005 and ~0.004, respectively. The native state has an equilibrium

probability of ~0.2. Figure 19B shows a schematic of a protein folding network that

attempts to capture all of these observations in a humanly comprehendible manner. All

of these observations are in qualitative agreement regardless of the degree of lumping;

that is, whether one uses smaller and more numerous states to capture more local

minima in the landscape or fewer and more voluminous states to obtain an even more

coarse-grained model. While one may be tempted to consider Figure 19B merely an

alternative depiction of a funnel, we emphasize that the kinetic connectivity of the

native state and lack of connectivity within the unfolded ensemble are important

qualitative deviations from traditional funnel theory (95).

69

An important methodological consequence of the network topology found here

is that many short, parallel simulations (or experiments) started from arbitrary initial

points are an excellent way of exploring the entire free energy landscape. In the

extreme case of using a single starting point, one could still reach every free energy

basin despite the presence of numerous metastable states so long as each simulation

was longer than the diameter of the network (the minimal time that allows one to reach

any state from an arbitrary starting point). However, reaching every state would be

impossible with simulations that were shorter than the diameter of the network. Thus,

our network theory provides an alternate explanation for the previously noted need to

have simulations longer than some minimal lag phase, which was then attributed to the

need to equilibrate within the unfolded state before folding in two-state systems (121).

Another simple but more efficient strategy would be to start simulations from

multiple conformations dispersed throughout phase space and run them long enough to

ensure mixing between them and coverage of the entire space. In fact, Figures 20 and

Figure 64 how that such a scheme is actually more valuable than a few long

trajectories, using a relative entropy metric for MSMs from Ref (18) to measure the

information content of different datasets relative to our validated villin model.

However, this trend can be seen to break down for simulations that are insufficiently

long or too few as they are unlikely to reach every state or traverse every possible

pathway between pairs of states. The simulation length at which this breakdown

occurs decreases as the number of simulations increases though. Even better

performance can be obtained using adaptive sampling algorithms (18, 19), which

direct sampling to where it is needed most to improve a model.

70

Figure 20. Distance between the final villin MSM and MSMs constructed from subsets of the data

(varying trajectory length and number of trajectories). Distance is measured by a relative entropy

metric (see Appendix D for details). Black lines are contours of equal amounts of data. No data was

available for the upper-right portion of the graph.

COMPARISON TO PREVIOUS THEORIES FOR PROTEIN FOLDING.

There is a long history of theoretical models for protein folding (53) so it is important

to put our work in the context of these previous theoretical approaches. In particular,

folding funnel models (95, 112, 118) have dominated much of how the field currently

conceptualizes protein folding and hence it is natural to compare our model to such

theories. One of the most similar funnel categories is type0B, which is characterized

by overall downhill folding interrupted by a glass transition along the reaction

coordinate (95). While this regime does include slow dynamics between compact

states, it also results in a small number of folding pathways relative to higher

connectivity in the unfolded ensemble. In addition, this and other previous funnel-

based models have explicitly described rapidly interconverting unfolded states, as

reflected in the “bottleneck” discussed in previous works (110, 111), as well as the

71

choice of structurally-based reaction coordinates like the number of native contacts

(Q) (95, 111), which directly requires that dynamics along orthogonal degrees of

freedom, such as interconversion between unfolded conformations, is rapid compared

to folding. In contrast, we find a large number of folding pathways, slow dynamics

between unfolded states relative to folding, and no glass transition. Our folding rates

are also quite similar, rather than the different rates characteristic of the folding

pathways in type0B folding.

Other funnel models have recognized the possibility of a large number of

folding pathways (95, 109, 118), but still in the context of fast dynamics within the

unfolded basin relative to slower transitions to the folded state. Some have even gone

so far as to assume global connectivity (122, 123); however, even these emphasize that

local connectivity would dominate in the full dimensional conformational space and

global connectivity only arises when projecting onto a few order parameters.

Furthermore, they argue global connectivity will not give an activation barrier and,

therefore, these models are primarily intended for studies of downhill folding or the

early activationless stages of folding. Our model, on the other hand, has a native hub

and slow dynamics in the unfolded state relative to faster folding regardless of the

degree of coarse-graining one employs. We also demonstrate that this can result in

apparent two-state folding (i.e. activated kinetics) and that this occurs in non-downhill

folding proteins, such as the millisecond folding NTL9.

CONCLUSIONS

Many biological systems, ranging from signaling pathways to social networks, can be

most naturally described as networks. As a field, we have now established a new level

to this hierarchy: a network theory for molecular kinetics that is able to map out the

free energy landscapes of proteins and other macromolecules in their entirety.

Previous work has demonstrated that this network theory is capable of

quantitative agreement with experiments (4, 5, 12, 93) and we have now shown that it

72

can also scale down to simple, generic models. Using this theory at both the

quantitative and qualitative levels, we have provided an intuition for conformational

changes as drastic as protein folding and this intuition has led to experimentally

testable insights into the nature of protein free energy landscapes.

We have focused on three new insights from these network models, which

appear to hold regardless of the degree of coarse-graining one employs and can be

reconciled with current experiments. First, even models that defy a kinetic

decomposition into two states often give rise to apparent two-state behavior. Second,

proteins have heterogeneous unfolded states (multiple basins that each interconvert

more rapidly with the native state than with one another, preventing a kinetic

decomposition into two states). Third, proteins have a native hub. Thus, it is possible

to fold quickly from anywhere in the landscape but proteins often get stuck in a web of

non-native states before finally folding, greatly increasing the average folding time.

These properties are a natural result of reasonably strong non-native

interactions and the enormous number of non-native conformations a protein can

adopt, in combination with evolutionary pressure to fold quickly (for example, to

avoid aggregation). Therefore, we suggest that these conclusions are likely true of

proteins in general. Our approach also unifies other models for protein folding by

recognizing that each of them builds upon elements, whether they are called folding

nuclei (102) or foldons (104), which correspond to different types of metastable states.

We look forward to a fruitful future of drawing on network theory to better

understand molecular kinetics and guide experiments probing both general properties

and system specific details. In particular, can one reinterpret the many experiments

that have been analyzed under a two-state assumption? If so, that could shed light on

the chemistry of the underlying structures that leads to the network topology and

dynamics described here. Moreover, can further experiments be designed to directly

probe the unfolded state under native conditions (rather than with denaturant or high

temperature, where mixing is more rapid) to directly test the predictions made here?

73

We also hope to explore how the methodologies developed for building and

understanding biomolecular networks may be applicable to other types of networks,

especially as network theorists attempt to develop a general framework for

understanding network dynamics.

MATERIALS & METHODS

ATOMIC RESOLUTION PROTEIN FOLDING SIMULATIONS AND NETWORKS.

Ref (4) describes the use of the MSMBuilder package

(https://simtk.org/home/msmbuilder/) (10) to construct an MSM with 10,000

microstates for the villin headpiece (HP-35 NleNle). This model was based on ~450

all-atom, explicit solvent simulations, each up to 2 μs in length, for a total simulation

time of 354 μs (58). While the longest timescale transitions in the model from Ref (4)

were found to be Markovian, implying memory-less transitions between metastable

states, not every state was metastable. We used MSMBuilder to lump kinetically

related microstates into 500 metastable macrostates to ensure a direct correlation

between states in the MSM and free energy basins, as described in the SI. This is

equivalent to common experimental analyses in which the potential is smoothed and

the friction is rescaled. We note, however, that the free energy landscape for this

system is actually a hierarchy of basins so it is possible to build many valid MSMs

with different numbers of states. As a result, one would not expect there to be exactly

500 experimentally detectable states. Regardless of the resolution at which one

examines this hierarchy, however, requiring that each state is metastable ensures that

they are directly related to a free energy basin. Thus, our networks of metastable states

are an important step beyond previously described networks, which often used simpler

approximations to define state boundaries and the transition rates between states (17,

95, 110, 124, 125). An additional 40,000 simulations, each up to 400 ns in length (for

a total simulation time of 14 milliseconds), were also assigned to this MSM to explore

the effect of using more simulations.


74

Preliminary results for a 39 residue fragment of NTL9 are based on an MSM

built from ~1.5 milliseconds of simulation in implicit solvent with a different force

field (93). Similarities between these two systems thus suggest our results are not a

force field artifact.

SIMPLE MODELS.

We have designed three simple networks, depicted in Figure 16, that capture the

essence of various protein folding mechanisms. Each of these models has six

metastable states with approximately the same equilibrium and transition probabilities

so that differences between their behaviors may be attributed to differences in their

topologies (see the Appendix Dfor details).

The first model (S) has a single folding pathway. This model is a natural

extension of the common U↔I↔N model (97, 126) and is often used to justify the

expense of running long simulations as shorter ones could fail to reach every state.

The second model (P) has parallel folding pathways. Parallel folding pathways

have been proposed for a number of systems (58, 98, 99, 109). In addition, this model

emphasizes the need to observe numerous folding and unfolding transitions to obtain

sufficient statistics on the entire process. The increased connectivity relative to S also

results in faster timescales.

The third model (H) has a heterogeneous unfolded state—multiple unfolded

basins that each interconvert more rapidly with the native state than with one another.

Thus, there is no kinetic decomposition of this model into two states, one folded and

one unfolded. This model was inspired by a growing body of work on the presence of

deep minima and gutters in unfolded regions of conformational space (114, 115, 127-

129).

75

CHAPTER 5: ATOMISTIC FOLDING SIMULATIONS OF THE FIVE HELIX

BUNDLE PROTEIN LAMBDA6-85

This chapter is in preparation as: Bowman GR, Voelz VA, Ensign DL, & Pande VS

(2010) Atomistic folding simulations of the five helix bundle protein λ6-85.

ABSTRACT

Understanding protein folding is a long-standing problem with important medical

applications, such as elucidating the role of protein misfolding in diseases like

Alzheimer’s. Solving the folding problem will ultimately require a combination of

theory and experiment, with theoretical models providing an atomically-detailed

picture of both the thermodynamics and kinetics of folding and experimental tests

grounding these models in reality. However, modeling long timescale dynamics (e.g.

microseconds, milliseconds, and beyond) with sufficient statistical accuracy and

chemical detail to make a quantitative connection with experiments is extremely

challenging. Here we report significant progress in this direction: an atomistic model

of the folding of an 80-residue fragment of the λ repressor protein with explicit solvent

that captures dynamics on 10 millisecond timescales. This advance greatly increases

the common ground accessible to both theory and experiment (both in terms of system

size and long timescales) and leads to a number of predictions that warrant further

experimental tests. For example, our model’s native state is a kinetic hub and

biexponential kinetics arise from the presence of many free energy basins separated by

barriers of different heights rather than a lack of barriers (the previously proposed

downhill scenario).

76

INTRODUCTION

Understanding protein folding is a long-standing problem with important medical

applications, such as elucidating the role of protein misfolding in diseases like

Alzheimer’s. Solving the folding problem will ultimately require a combination of

theory and experiment, with theoretical models providing an atomically-detailed

picture of both the thermodynamics and kinetics of folding and experimental tests

grounding these models in reality. However, modeling long timescale dynamics (e.g.

microseconds, milliseconds, and beyond) with sufficient statistical accuracy and

chemical detail to make a quantitative connection with experiments is extremely

challenging. Much progress has been made with small, fast-folding proteins but can

the methods used scale to larger, slower systems? Here we report significant progress

in this direction: an atomistic model of the folding of an 80-residue fragment of the λ

repressor protein with explicit solvent that captures dynamics on a 10 millisecond

timescale.

This advance builds on a growing body of work on describing molecular

kinetics with Markov State Models (MSMs). MSMs are essentially maps of a

molecule’s conformational space (1-3, 6). However, instead of having towns

connected by roads labeled with speed limits, MSMs have metastable states (sets of

rapidly interconverting conformations) connected by edges giving the probability of

going from one state to another. One can exploit the kinetic definition of states in an

MSM to scale from high-resolution models capable of quantitative agreement with

experiments to low-resolution models that provide an intuition for the system. In

addition, one can break up slow processes like protein folding into many small steps

that can be studied with short, parallel simulations.

The proteins studied with MSMs to date have generally been small and fast

folding (see Refs (3) and (2) for reviews). For example, we have built a model for a

35-residue mutant of the villin headpiece (4) that folds on the μs timescale. The native

state of this model (i.e. lowest free energy state) was within 1.8 Å of the crystal

77

structure, an important achievement given that all the simulations used to build the

model started from unfolded conformations. Noe et al. have built an MSM for a Pin

WW domain (5) (34 residues, μs folding time) and Voelz et al. have built an MSM for

a 39-residue fragment of NTL9 (93) (the first millisecond folder to be modeled with

MSMs). The ability of these models to predict structures, thermodynamics, and rates

indicates they should be capable of predicting any experimental observable, since all

are functions of these properties.

To test whether the MSM approach can scale to larger systems, we have

applied it to the D14A mutant of an 80-residue fragment of the λ repressor protein

(72). Full length λ repressor is a 236-residue protein capable of dimerizing and

binding to DNA, maintaining the λ phage in the lysogenic state and regulating its own

expression. Figure 21A shows the crystal structure of a 92-residue fragment that can

still dimerize and bind to DNA (130, 131). Based on this structure, Huang and Oas

selected an 80-residue fragment (λ6-85) that favors the monomeric state (Figure 21B),

making it appropriate for folding studies (132). This fragment was one of the first sub-

millisecond timescale folders to be discovered. Subsequently, a number of mutants of

λ6-85 have been found to fold on faster timescales (72, 133-135). The D14A mutant is

one of the fastest folders, having an approximately 2 μs molecular phase and an

approximately 10 μs activated phase (72). These timescales have been attributed to

downhill (or barrierless) and two-state folding, respectively.

Figure 21. (A) The crystal structure of the λ1-92 dimer bound to DNA (PDB code 1LMB). (B) A model

of λ 6-85 with the Trp22-Tyr33 pair monitored in T-jump experiments space-filled.

78

The fast timescales reported for D14A make it a prime candidate for atomistic

molecular dynamics simulations combined with MSMs, which can now capture

millisecond timescales (93). We have run 3,265 trajectories with explicit solvent at

370 K. Each one is up to 1 μs in length, for an aggregate of 1.3 milliseconds of

simulation. These simulations were started from six initial configurations drawn from

replica exchange simulations in implicit solvent (136). One is native-like, three are

partially unfolded, and two have β-sheets. A more detailed description of our

simulations is given in Appendix E. We then constructed a high-resolution MSM with

30,000 microstates that is appropriate for making quantitative connections with

experiments. A low-resolution model with 5,000 macrostates was created from the

high-resolution MSM to facilitate interpretation of the model. More details on our use

of the MSMBuilder package (10) to construct these models are given in Appendix E.

While no single trajectory visits every state, these MSMs are able to capture long

timescale dynamics by exploiting overlap between our simulations to stitch them

together in a physically and statistically meaningful way. Examination of the implied

timescales of the microstate MSM shows that a five ns lag time yields Markovian

behavior (Figure 65).


Analysis of our high-resolution MSM reveals the presence of 10 millisecond

timescales. These timescales are preserved in an independent dataset run at 300 K and

subsamples of the 370 K dataset (Figure 66 and Figure 67), indicating that they are a

robust feature of the simulated system. Do these slow timescales reflect inadequacies

in the simulation parameters (the force field)? For example, λ repressor’s folding time

is known to be sensitive to solvent viscosity (137), so small errors in our

parameterization could easily affect our predicted rates. Or could the experimental

probes and techniques used to date be insensitive to these long timescales? One might

expect D14A, with its sizeable hydrophobic core, to fold on slower timescales given

79

that the wild-type villin headpiece (which is less than half the size of D14A and barely

has a hydrophobic core) is also reported to fold in just under ten μs (73).

To explore these possibilities we mapped out the 10 millisecond

timescale conformational rearrangement. Analysis of our coarse-grained MSM reveals

that this slow timescale corresponds to exchange between a compact β-sheet structure

and the crystal structure through multiple parallel pathways (Figure 68 and Figure 69).

Figure 22 shows a representative pathway between these states from our high-

resolution MSM. First, the compact β-sheet structure expands, breaking apart the β-

sheets. Then helices 1 and 4 begin to form, followed by collapse into a native-like

topology. Finally, the remaining helices form. As in a previous study (138), more

conventional projections of the free energy landscape were less informative (Figure 70

and Figure 71).

80

Figure 22. One of the 10 millisecond timescale pathways labeled with pfold values (the probability of

reaching state H before state A).

The prediction of β-sheet states in the unfolded ensemble under folding

conditions is somewhat surprising for a helical protein, especially since they are well

populated (Figure 72). However, experiments have shown that the unfolded and

denatured states of many systems can have significant populations of compact, β-sheet

structures yet still display the random coil statistics characteristic of expanded

conformations (139, 140). Thus, our prediction of compact, β-sheet structures is not

unreasonable.

As a further test we used our MSM to model the relaxation of a surrogate for

the Trp22-Tyr33 quenching interaction measured in T-jump experiments and a more

81

global metric, the Cα RMSD to the crystal structure (Figure 73). Both have

biexponential relaxation—a characteristic of D14A that has been used to argue that it

is a downhill folder—but the molecular phase is about two orders of magnitude slower

than in experiment (1 millisecond versus 10 μs). However, ignoring simulations

started from β-sheet structures yields better agreement (Figure 74). First, the Trp22-

Tyr33 surrogate has a 1 μs molecular phase and a 4.3 μs activated phase, in reasonable

agreement with the experimental values of 2 and 10 μs. Secondly, the RMSD now

relaxes on different timescales, consistent with observed probe dependent kinetics

(141, 142). Projections of the free energy onto a kinetically meaningful reaction

coordinate (pfold(51)) are not purely downhill, but could be consistent with incipient

downhill folding along parallel pathways (Figure 75). Incipient downhill folding is a

scenario in which a barrier is present but is sufficiently low that its peak is well

populated; therefore, one observes downhill folding (a molecular phase) from the

barrier top and two-state folding (an activated phase) across the barrier.

Based on these results, we cannot conclusively determine whether the stability

of the β-sheet states is a force field artifact or a feature of D14A not yet detected by

experiments. It is possible that short T-jumps simply cannot reach the β-sheet states.

Fully resolving this issue will likely require more experiments and more points of

comparison between theory and experiment. Regardless of the outcome, it is exciting

that MSMs built from atomistic simulations can now capture 10 millisecond

timescales.

The crystallographic state (Figure 22H, probability ~0.09) is not the native

(most stable) state in our model. The native state in our model (Figure 22G,

probability ~0.44) differs from the crystallographic state in that helix five is unraveled

and packed against the side of the protein. This observation is consistent with both the

negligible helical propensity in helix five reported by Agadir (143) (Figure 76) and the

context of this helix in the original crystal structure (Figure 21A), where it is extended

by seven residues. These extra residues form important contacts between the two

members of the dimmer that could stabilize helix five. Truncating the sequence to

82

favor the monomer could lead to a lack of structure in the remaining residues of helix

five, resulting in a strong propensity to fill in the hydrophobic cavity normally

occupied by the corresponding helix in the other member of the dimmer or adopt one

of a number of other well-populated, unstructured conformations (Figure 72). Further

support for this observation comes from the fact that a crystal structure of λ6-85 has

high B-factors in helix five (135) and the stability of this system seems to be

insensitive to mutations in this helix (136). Similar results were also found in a Gō

model study, where helix five tended to un-dock from the rest of the protein (138).

However, Gō models do not include non-native interactions, so helix five was not

found to unravel or pack against the protein. The behavior of a variational model (144)

and a diffusion-collision model (145) also differ from that found here due to the lack

of non-native interactions. However, the diffusion-collision model is similar in nature

to our MSM approach in its use of states and rates. Helix five was also found to be

unstable in replica exchange simulations with implicit solvent (136).

Our MSM for D14A is also consistent with previous reports of native hubs (16,

146). A first hint of this comes from the large number of connections to our native

state (Figure 23). The native state in our model makes direct connections to 98% of

the non-native states while non-native states only connect to 0.1% of the other states

on average. Moreover, the MFPTs to the native state are typically ~10 times faster

than the MFPTs between non-native states, as shown in Figure 24. Therefore,

molecules in non-native states can generally fold faster than they can transition to

other non-native states. The fastest way to transition between two randomly selected

non-native states is then to fold and unfold.

83

Figure 23. The 500 most populated macrostates with sizes proportional to their free energies and

connections between states if transitions between them occurred in our simulations. The native

state (green state with green connections) is a hub. The crystallographic state from Figure 22H is

blue, the compact β-sheet state from Figure 22A is red, and the remaining states are yellow. All of

these states have smaller equilibrium populations and fewer connections than the native state.

Figure 24. Distributions of mean first passage times (MFPTs) between sets of microstates (A) without

weighting the distribution and (B) weighting each MFPT by the equilibrium probability of the

starting state. The solid line is the distribution of MFPTs from non-native to native microstates and

84

the dashed line is the distribution of MFPTs between non-native states. The average MFPT from

non-native states to native ones is about 10 times faster than that between non-native states in (A)

and the difference is even greater in (B). Native microstates were defined as those in the most

populated macrostate. All other microstates were considered non-native.

This hub model presents an alternative to the two-state and downhill models

often used to describe protein folding and interpret experiments. Rather than having a

single dominant barrier or no barrier at all, the hub model has many metastable states

separated by barriers of different heights and numerous unfolded basins that

interconvert more rapidly with the native state than one another. Therefore, there are

many parallel folding pathways. We have already showed that MSMs with native hubs

can predict the dominant two-state behavior and burst phase kinetics of other systems

(16). Here we show that MSMs with native hubs can also predict the biexponential

relaxation of D14A that has previously been attributed to downhill (or barrierless)

folding (72, 147). Our previous work proposed that the native hub results from non-

negligible non-native contacts, which must be broken in order to fold (16, 146). Figure

22 demonstrates this behavior in our model of D14A. Testing the hub model will

require more experiments on the unfolded state under native conditions (rather than at

high temperature or in the presence of denaturant, where the unfolded ensemble is

likely more diffuse).

CONCLUSIONS

The combination of simulations and MSMs can now access ~10 millisecond

timescales for moderately large (~80 residue) systems, greatly increasing the common

ground between theory and experiment. The ability of our MSMs to capture

biexponential kinetics also indicates that proteins previously designated as downhill

folders may actually have many barriers of differing heights. In addition, our model

leads to a number of predictions for D14A: 1) current experiments may be failing to

detect processes on 10 millisecond timescales, 2) there may be significant β-sheet

structure in the unfolded ensemble under native conditions, 3) helix five may unfold

85

and fill a hydrophobic pocket in the native state and lack structure in other well

populated states, and 4) the native state may act as a kinetic hub. Our ability to

reconcile these observations with existing experiments suggests that more

experimental data will be necessary to provide a detailed description of how D14A

folds. We suggest that MSMs could be used to help design such experiments and lead

to important new insights into folding or, at the very least, provide more data for

refining existing force fields and improving the agreement between theory and

experiment.

86

CHAPTER 6: ENHANCED MODELING VIA NETWORK THEORY: ADAPTIVE

SAMPLING OF MARKOV STATE MODELS

This chapter was taken from: Bowman GR, Ensign DL, & Pande VS (2010) Enhanced

modeling via network theory: adaptive sampling of Markov state models. J Chem

Theory Comput 6:787-794.

ABSTRACT

Computer simulations can complement experiments by providing insight into

molecular kinetics with atomic resolution. Unfortunately, even the most powerful

supercomputers can only simulate small systems for short timescales, leaving

modeling of most biologically relevant systems and timescales intractable. In this

work, however, we show that molecular simulations driven by adaptive sampling of

networks called Markov State Models (MSMs) can yield tremendous time and

resource savings, allowing previously intractable calculations to be performed on a

routine basis on existing hardware. We also introduce a distance metric (based on the

relative entropy) for comparing MSMs. We primarily employ this metric to judge the

convergence of various sampling schemes but it could also be employed to assess the

effects of perturbations to a system (e.g. determining how changing the temperature or

making a mutation changes a system’s dynamics).

INTRODUCTION

Molecular dynamics simulations are a powerful means of understanding both the

thermodynamics and kinetics of molecular processes like protein folding and

conformational changes. Unfortunately, such processes are highly sensitive to the

underlying chemical details. For example, point mutations in the amino acid sequence

of a protein may have significant effects on its kinetics (147) and a small number of

87

point mutations can even drastically change the native structure (148). Thus, atomistic

simulations are required to make quantitative connections with experiments (149,

150).

Advances in computing have made it possible to rapidly generate huge data

sets even at this level of chemical detail (79, 151); however, these data sets are still

insufficient. A typical computer can only simulate ~5 nanoseconds/day of protein

folding and would thus take over 500 years to simulate one millisecond, an average

folding time typical of proteins. Whether one is interested in dynamics or merely

equilibrium probabilities, a kinetic perspective on this problem that explicitly

considers the rate of equilibration reveals that metastability, or the presence of long-

lived states that act as “traps”, is a common source of inefficiency.

One approach to dealing with this issue is to make tremendous investments in

specialized software and hardware for generating long simulations (152). While

theoretically sound (153), this serial approach often only results in simulations that are

long relative to standard trajectories. However, a truly-long simulation must be orders

of magnitude longer than the slowest relaxation time so that the probabilities of all

states and pathways can be estimated accurately. Even if such a simulation were

possible, the task of analyzing the data would still remain (152, 154). Moreover, serial

approaches are inherently inefficient, both due to parallelization overhead and, more

importantly, the fact that they waste hundreds of years of computing time waiting for

rare events.

A statistical approach provides a fundamentally different perspective on model

construction. Rather than attempting to generate one realization of an entire process,

one instead aims to generate an ensemble of events in parallel. For example, a number

of methods have been developed for exploiting statistical mechanics to simulate

protein folding more efficiently (69, 84, 155, 156). Most of these approaches rely on

the fact that in two-state protein folding, the waiting time for observing a transition is

exponentially distributed but the actual transition times are quite rapid (120). Thus,

88

proteins often fold much faster or slower than the average folding time. Such

approaches are amenable to commodity hardware and take far less wall-clock time

than a serial approach with an equivalent amount of sampling, particularly when

combined with grid computing (79). Unfortunately, these methods are generally only

applicable to two-state systems and may require simulations of an unknown minimum

length (121). Some multi-state generalizations exist (157) but quickly become

computationally intractable.

Markov State Models (MSMs) extend this work by allowing for a tractable,

multi-state scheme that allows efficient modeling of any system exhibiting

metastability (9). An MSM is a network with nodes corresponding to metastable states

and edges describing the rates of transitioning between pairs of states, akin to a map

with cities connected by roads labeled with speed-limits. Rather than attempting to

generate one realization of an entire process, one can exploit the decomposition of

conformational space into multiple metastable states to gather statistics on each step of

the process independently, allowing a problem to be broken up into more manageable

and trivially parallelizable pieces.

Mathematically, MSMs are represented as transition probability matrices, with

the entry in row i and column j giving the probability of transitioning from state i to

state j within a time interval called the lag time of the model. Building MSMs is a

challenging task but significant progress has been made over the past few years (3, 4,

6, 10), leading to freely available software for automatically constructing these models

(10). While MSMs could be used to analyze truly long simulations, their ultimate

value lies in their ability to facilitate efficient model construction by allowing precise,

parallel determination of the transition rates between states by running many short

simulations from each of them.

Adaptive sampling algorithms for MSM construction take this statistical

approach a step further (12, 19, 20). In adaptive sampling, one first obtains an initial

model of the entire process of interest by any means possible. One then iteratively

89

calculates the contribution of each step of the process to uncertainties in some

observable of interest via Bayesian statistics and runs numerous parallel simulations of

the steps that can lead to the greatest increases in precision until the desired level of

statistical certainty is achieved. Such an approach was recently shown to lead to

dramatic reductions in the statistical uncertainty in the observable of interest relative

to other refinement schemes (19).

However, a number of important questions remain to be answered. First, does

adaptive sampling improve the global model quality or just local components that are

important for the observable of interest? Exactly how much more efficient is adaptive

sampling? And finally, is adaptive sampling capable of discovering previously

unknown components of a model, or is it only able to refine the initial model it is

given?

In this work, we address these questions using an MSM for the villin headpiece

(HP-35 NleNle) that was recently constructed from atomistic simulations with explicit

solvent (4). We then move on to simple models, where the role of the network is clear,

to gain an intuition for our results and test whether such methods could be more

broadly applicable to a wide class of different types of systems. These analyses rely on

a new distance metric for MSMs developed in Section 2.2, which should prove

generally useful for evaluating various sampling schemes and even assessing the

effects of perturbations to a system (like changes in temperature or even mutations).

THEORETICAL UNDERPINNINGS

ADAPTIVE SAMPLING.

In adaptive sampling approaches to MSM construction, simulations are run iteratively

to minimize uncertainties in some property of a model (12, 19, 20). In this work,

adaptive sampling is performed as follows:

1. perform N simulations of L steps starting from a particular starting state(s)

90

2. build an MSM only including those states identified so far

3. calculate the contribution of each state to uncertainty in the slowest kinetic rate

following Ref (19)

4. start N new simulations of L steps distributed amongst the states in proportion to

their contribution to uncertainty in the slowest rate

5. repeat steps 2-4 for some number of iterations

All the MSMs in this work were constructed and analyzed with the

MSMBuilder package (which is freely available at https://simtk.org/home/msmbuilder/)

(10) modified such that transition count matrices were not symmetrized by counting

the transitions that would have been observed if one watched each simulation

backwards.

We note that in the past simulations in each round of adaptive sampling were

all started from the same initial state (the one contributing most to uncertainty in the

quantity of interest) (19). The intuition behind our alteration was that as the number of

simulations (N) becomes large, starting all the simulations from one state would be

excessive as fewer would be sufficient to drastically reduce the uncertainty. Instead, it

would be preferable to allocate some of these excess simulations to reduce

uncertainties in other states’ transition probabilities. Indeed, we have found that our

modified procedure yields better results for sufficiently large N on reasonably

complex networks and gives equivalent results for simple networks and small N.

To demonstrate the utility of this algorithm, we carried out adaptive sampling

with synthetic trajectories generated from transition count matrices. To generate

synthetic simulations from a transition count matrix we first normalize each row to

obtain a transition probability matrix. At each time step (or each lag time), the next

state is chosen according to the distribution of transition probabilities for the current


91

state. The prior described below is not used for these calculations, so the matrices used

to generate trajectories tend to be sparse.

QUANTIFYING THE SIMILARITY BETWEEN MSMS.

In order to monitor the convergence of any sampling scheme, it is important to first

develop a similarity metric that is capable of measuring the global quality of a test

model relative to some reference model. Such a metric would also have broad

usefulness, as there are several reasons for comparing MSMs quantitatively. For

example, this metric could be used to compare MSMs generated by two different

simulation methods allowing one to directly compare the resulting dynamics.

Alternatively, one could compare MSMs generated by two somewhat different, but

related systems, such as comparing the simulations of the dynamics of two point

mutants of a given protein.

We have developed such a distance metric for MSMs that is based on the

relative entropy, which is a common measure of the distance between two probability

distributions in information theory (158) with important physical implications (159).

The relative entropy between two normalized distributions P and Q, over a common

set of outcomes, is

i i

ii Q

PPQPD log)||(

where Pi is the probability of outcome i, P is a reference distribution, and Q is some

test distribution.

An MSM consists of one normalized distribution per state, which gives the

probability of transitioning to each other state within one lag time. We define the

relative entropy between a reference and test MSM, with transition matrices P and Q

respectively, as

92

N

ji ij

ijiji Q

PPPQPD

,

log)||( (6.1)

where Pi is the equilibrium probability of state i, Pij is the probability of transitioning

from state i to state j during one lag time, and N is the number of states. Intuitively,

our relative entropy metric is the sum of the relative entropies between the transition

probability distributions for each state weighted by their stationary probabilities.

One may derive our relative entropy metric for MSMs more formally by

considering that the entropy (H) of a sample path of a stochastic process, normalized

by its length, is also called the entropy rate. An important theorem in information

theory is the following:

Theorem. For an ergodic stochastic process X1, …, Xn

),...,|(lim),...,(1

lim 111 nn

nn

nXXXHXXH

n

For a Markov Chain, the right hand side takes a very simple form, because the

conditional entropy only depends on the previous step, which converges to the

stationary distribution.

In the following, we prove a similar statement for the relative entropy between

the paths of two Markov chains as n goes to infinity. For two Markov chains p and q

with state space Ω, we would like to compute:

)),...,(||),...,((1

lim 11 nnn

XXqXXpDn

For simplicity, let us define lowercase xn = X1, …, Xn. Then, by the

chain rule for the relative entropy, we get:

93

))]|(||)|(())(||)(([1

lim 1111 nnnnnn

nxXqxXpDxqxpD

n (6.2)

Eq. 2.65 in Cover & Thomas (160) defines the conditional relative entropy

above as the expectation of the relative entropy between the conditional distributions

of Xn given xn-1, with respect to the distribution of xn-1. This means that:

Ynnn

ynnnnnnn

YXqYXpDYXp

yXqyXpDyxpxXqxXpDn

))|(||)|(()(

))|(||)|(()())|(||)|((

1

1111

where we have grouped terms with the same final state in the “history" y, which have

the same relative entropy factor, and summed their probabilities to obtain the marginal

probability over Xn-1.

Repeating the step that led to Eq. 6.2 many times yields:

))(||)((]))|(||)|(([1

lim 112

11 XqXpDxXqxXpDn

n

mmmmm

n

If the initial state is deterministic, the last term is just zero. As for the first

term, as n goes to infinity, the distribution of Xm-1 goes to the stationary distribution of

p, which we call μ. Then, using the equation for the conditional entropy,

Z Ynnnn

n ZYq

ZYpZYpZxXqxXpD ]

)|(

)|(log[)|()())|(||)|((lim 11

Since the terms in the series converge to a limit, their Cesaro means

converge to the same limit, so:

Z Ynn

n ZYq

ZYpZYpZXXqXXpD

n]

)|(

)|(log[)|()()),...,(||),...,((

1lim 11

94

The terms p(Y|Z) and q(Y|Z) are just the elements of the transition matrices of

p and q respectively, so this is equivalent to Eq. 6.1.

PRIOR FOR RELATIVE ENTROPY AND ADAPTIVE SAMPLING.

There is always some probability of transitioning between every pair of states, though

these probabilities may be low enough that no actual transitions are observed. To

account for this, as well as to reflect our lack of prior knowledge about the transition

probabilities, we add a pseudo-count of 1/N to every element of the transition count

matrix, where N is the number of states, before normalizing each row to find the

transition probability matrix, as in Refs (19, 161). The intuition behind this choice is

that for a state to exist we must observe at least one count in that state but before

observing any real data the probability of this count leading to any other state is equal.

From a Bayesian perspective, these pseudo-counts equate to a uniform prior. These

pseudo-counts also prevent the relative entropy metric from becoming infinite

whenever a zero is encountered in an MSM’s transition probability matrix. It is often

the case that certain transitions are not observed, so this correction is of great practical

importance.

VILLIN SIMULATIONS AND MSM.

The simulation details for the original ~450 villin simulations are described in detail in

Ref (58). In short, ~450 constant temperature molecular dynamics simulations with

explicit solvent and up to 2 μs in length were run from nine initial configurations

drawn from high temperature unfolding simulations at 373 K. Ref (4) describes the

construction of a 10,000 microstate MSM that faithfully reproduces the raw simulation

data. For the purposes of this work, we lumped these 10,000 microstates into 500

macrostates exhibiting metastability and having an equivalent Markov time (15 ns).

This lumping was done with the MSMBuilder package (10). The macrostates

containing the nine initial configurations used during the real simulations were used as

95

the starting points for adaptive sampling. Simulations of just 30 ns were used for

adaptive sampling.

SIMPLE MODELS.

The transition count matrices for simple models S and P (CS and CP respectively) are

000,9030000

3000,13000

03000,1300

003000,130

0003000,13

00003000,6

SC

and

000,9022000

2000,10220

20000,1220

022000,102

0220000,12

00022000,6

PC

where the entry in row i and column j gives the number of transitions observed from

state i to state j.

Mean first passage times were calculated following Ref (161). The mean first

passage times for S and P are ~13,000 and ~5,000 steps respectively. Other

equilibrium properties can be obtained by normalizing each row to obtain a transition

probability matrix and then solving for the eigenvalues and eigenvectors of this

matrix. For example, normalizing the first eigenvector (e.g. the one corresponding to

an eigenvalue of 1) gives the equilibrium probabilities of each state. Subsequent

eigenvalue/eigenvector pairs give kinetic rates and the states involved in these

96

transitions respectively (9). Once again, the MSMBuilder package (10) was used for

analysis of these models.

Plots of the average relative entropy as a function of simulation number and

length were generated by running 600 simulations of 5,000 steps for each model.

Average relative entropies over 10 random samples of N trajectories from this pool

were then calculated and plotted. Similar plots for our adaptive sampling scheme were

also generated by averaging over 10 independent runs.


APPLICATION TO VILLIN MSM.

With these tools in place, we are now in a position to assess the efficacy of adaptive

sampling using a previously calculated MSM for the villin headpiece (4) as a model

system. In particular, we would like to assess two types of efficiency. First, given our

desire to push the envelope of what is possible in a reasonable amount of time, can

adaptive sampling reduce the wall-clock time necessary to achieve a given model

quality? Second, given our desire to mitigate negative impacts on the environment,

can adaptive sampling reduce the amount of resources (in this case computer time)

necessary to achieve a given model quality?

To address these questions we have performed adaptive sampling with a

variable number of simulations per iteration generated from our villin MSM. We then

assume each simulation progresses at a rate of 5 ns/day, a typical value for modern

personal computers, and compare the convergence of our adaptive simulations to the

gold-standard model from Ref (4) (that was validated by comparison to both the raw

simulation data and experiments) with the convergence of a single long reference

simulation to the same gold-standard. Convergence to the gold-standard model is

measured with our relative entropy metric for MSMs (described in Section 2.2).

97

Figure 25A shows that the wall-clock time efficiency of adaptive sampling

scales linearly up to 5,000 simulations per iteration. That is, adaptive sampling with N

simulations per iteration can reduce the wall-clock time necessary to achieve a given

model quality by a factor of N for N as high as 5,000. Using more simulations will

help but will only reduce the wall-clock time by a factor of αN, where α<1. The

crucial result, however, is that one can reduce a calculation that would take decades to

run with traditional methods to a calculation that can be run in a matter of days with

adaptive sampling.

98

Figure 25. Scaling for adaptive sampling of villin as the number of parallel simulations (N) used during

each round is varied. (A) Wall-clock time scaling as N is varied. The black line is a best fit to the

linear portion of the data (circles), which extends up to 5,000 simulations per iteration. (B)

Computer time required to achieve a given model quality (relative entropy) for various sampling

schemes. L refers to one long trajectory and the numbers refer to the number of parallel simulations

used in each iteration of adaptive sampling. All results come from averaging over ten independent

runs. Each step equates to 15 ns.

99

Adaptive sampling can also greatly reduce the resource requirements for

achieving a given model quality. For example, Figure 25B shows the computer time

necessary to achieve a given model quality for one long simulation and adaptive

sampling with a varying number of simulations per iteration. This figure shows that

adaptive sampling requires about half as much computer time to achieve the same

model quality as one long simulation. Once again, the relative efficiency of adaptive

sampling begins to fall off beyond some optimal number of simulations per iteration.

APPLICATION TO SIMPLE MODELS.

To gain an intuition for the applicability of adaptive sampling to other systems, we

have also applied it to two classic network topologies, shown in Figure 26A and

defined more thoroughly in Section 2.5. These models are representative of problems

with metastability, their equilibrium properties can be derived analytically and used as

an unambiguous reference, and truly-long simulations are feasible.

100

Figure 26. (A) The two models, S and P. (B) Distance from the true model (measured via the relative

entropy) as a function of wall-clock time for adaptive sampling versus one long simulation of S

(assuming 5 steps/day to mimic 5 nanoseconds/day in protein folding simulations). The lines are

one long simulation (dashed line) and adaptive sampling with 10 simulations of 20 steps (solid

line), 10 simulations of 200 steps (dotted line), 100 simulations of 20 steps (dash-dot line), and

1000 simulations of 20 steps (black squares) per iteration.

Both models have states with approximately the same equilibrium and

transition probabilities, such that differences between their behaviors can be attributed

to differences between their topologies. More specifically, states 1-6 have equilibrium

populations of 6%, 1%, 1%, 1%, 1%, and 90% respectively. Drawing an analogy to

protein folding, state 1 is the unfolded state, state 6 is the folded state, and the

101

remaining states are intermediates. Thus, S has a single folding pathway and P has

parallel folding pathways.

The reduced connectivity in S results in longer timescale transitions relative to

P. In fact, the mean first passage time (MFPT) between states 1 and 6 is about three

times longer in S than in P, making S considerably harder to sample. In addition, such

linear models are often cited as a case where the holistic, long-trajectory approach is

absolutely necessary; nevertheless, adaptive sampling is able to learn the network

more efficiently than traditional approaches, as shown in Figure 26B. This figure

shows how close various schemes can approach the true model for S given a set

amount of wall-clock time and starting from state 1 to mimic the practice of starting

protein folding simulations from an arbitrary conformation in the unfolded state.

To provide some intuition for our distance metric, Figure 27 shows the

evolution of the relative entropy and the estimated free energy of each state in S

during adaptive sampling. Adaptive sampling was carried out by running 10

simulations from state 1 and then repeatedly building an MSM and starting 10 new

simulations from the state contributing most to uncertainty in the slowest process.

Small jumps in the relative entropy are found each time a state with a low population

is discovered (or, equivalently, when a new path is discovered for this model) and a

very large jump is evident when the most populated state, state 6, is discovered. Slow

decay occurs between these jumps. Thus, our metric is most sensitive to state and path

discovery but still captures improvements in estimates of the transition probabilities

along known paths. Such behavior is desirable as models that miss important states or

paths should be penalized more than ones with imperfect transition probabilities.

102

Figure 27. Relative entropy (top) and free energy of each state in kcal/mol (bottom) as a function of the

adaptive sampling iteration on model S.

Figure 28 shows a more thorough comparison of adaptive sampling and

reference simulations with an equal amount of sampling for various numbers and

lengths of simulations. Evaluation of the reference simulations for both S and P

demonstrates that achieving a reasonable model quality by naively starting simulations

from state 1 requires simulations of some minimal length, though this minimal length

is shorter for P than S in terms of the absolute number of steps. Moreover, adaptive

sampling is able to gain valuable information from much shorter and fewer

simulations regardless of the topology of the network; that is, whether there is a single

folding pathway or multiple pathways. This figure also shows that adaptive sampling

generally benefits from using more parallel simulations but not longer ones. An

important point is that each data point in Figure 28B and Figure 28D depends on the

data points to its left. For example, to fill in the row corresponding to simulations of

length 100, ten independent adaptive sampling runs of 50 iterations were performed.

103

The first round of each adaptive sampling run was used to compute average relative

entropies for 1-10 simulations, the first and second round of each run (which depends

on the first round) for 11-20 simulations, and so forth. As a result, there is some

horizontal streakiness in these figures. We also note that adaptive sampling results in

smaller uncertainties in the relative entropies shown in Figure 28 (see Figure 77and

Figure 78).

Figure 28. Distance from the true model (measured via the relative entropy) as a function of the number

and length of simulations averaged over 10 independent samples. (A) Reference distribution for S,

(B) adaptive sampling of S, (C) reference distribution for P, and (D) adaptive sampling of P. All

simulations for the reference distributions started from state 1. The first 10 simulations for adaptive

sampling started from state 1 and subsequent batches of simulations started from the state

contributing most to uncertainty in the slowest process. Black lines are contours of equal amounts

of data.

Finally, we find that the scaling of adaptive sampling of our simple networks is

similar to that found for villin, as shown in Figure 29. One noteworthy difference is

104

that our simple models saturate (i.e. fall short of linear scaling as additional parallel

simulations are run) earlier than villin. Comparison of the two simple models also

shows that S saturates before P. For S, adaptive sampling scales linearly up to 150

parallel simulations. For P, adaptive sampling scales linearly up to 500 simulations.

The improved scaling for P is the result of the increased complexity of the network

topology of P compared to S. Each node in P has more connections to learn and the

algorithm benefits from doing this in parallel. Indeed, the complexity of our villin

model is much greater than either of these simple networks and, as discussed

previously, villin scales linearly up to 5,000 simulations per iteration. Thus, we expect

that we can achieve linear scaling well beyond 5,000 simulations per iteration for

systems that are more complex than the villin MSM that we sampled from.

105

Figure 29. Scaling for adaptive sampling of our simple models as the number of parallel simulations (N)

used during each round is varied. (A) and (B) Wall-clock time scaling as N is varied for simple

models S and P respectively. The black line is a best fit to the linear portion of the data (circles).

(C) and (D) Computer time required to achieve a given model quality (relative entropy) for various

sampling schemes applied to S and P respectively. L refers to one long trajectory and the numbers

refer to the number of parallel simulations used in each iteration of adaptive sampling. All results

come from averaging over ten independent runs.

106

APPLICABILITY.

The adaptive sampling algorithm employed here was developed for application to

MSMs with metastable states. That is, it assumes that every state has a self-transition

probability greater than 0.5 such that a simulation in one state is more likely to stay

there than to transition to a new state. This property helps to ensure a separation of

timescales (fast intrastate transitions, slow interstate transitions) and, therefore, that

the model is Markovian because a simulation can lose memory of its previous state

before transitioning to a new one. Thus, the procedure for ab initio adaptive sampling

is: 1) run some initial simulations, 2) cluster all the simulation data into microstates, 3)

lump these microstates into metastable macrostates, 4) calculate the contribution of

each macrostate to uncertainties in the slowest rate (or some other observable), 5) start

new simulations from each state in proportion to its contribution to the overall

uncertainty, and 6) repeat steps 2-5 until the desired level of statistical certainty is

achieved. In the future it will be interesting to explore whether this adaptive sampling

algorithm is equally applicable to more fine grained divisions of conformational space

(e.g. at the microstate level) as the lumping stage would no longer be necessary. In

addition, recent work has shown that more fine grained MSMs are better for obtaining

quantitative predictions of experimental observables (4, 5, 15), so it could be

advantageous to do refinement at this level.

The relative entropy metric assumes that the two models being compared have

the same state-space. Comparing two simulation data sets therefore requires the

following steps: 1) define a state space common to both datasets (i.e. by using both

data sets for clustering to define microstates and, optionally, lumping to define

macrostates), 2) computing transition probability matrices for each data set

independently, and 3) computing the relative entropy between these matrices.

107

CONCLUSIONS

Together, our results with villin and fundamental model systems demonstrate the

tremendous value of adaptive sampling. Since model quality has been assessed with a

global metric and shows strong agreement between adaptive sampling results and the

true model, we can conclude that adaptive sampling to minimize uncertainties in the

slowest kinetic rate improves the global quality of a model. Moreover, adaptive

sampling is significantly more efficient than a single long simulation, both in terms of

the wall-clock time and resources required to achieve a given model quality, up to

some saturation point. In fact, adaptive sampling with N parallel simulations requires

about a factor of two less computer-time and a factor of N less wall-clock time.

Considering that N can easily be as large as 10,000 (or more) (79), this can be a truly

dramatic advantage in wall-clock time, turning calculations normally requiring

decades into routine calculations on the timescale of days. Finally, since our

simulations started from just a couple of states, we can conclude that adaptive

sampling is capable of discovering new model components given no prior knowledge

of the system, and is thus useful for model construction in addition to model

refinement.

The adaptive sampling method described here may be directly applied to learn

models from simulations of metastable phenomena, leading to significant resource and

time savings in fields like molecular and quantum mechanics, but is not limited to

these applications. Given a means to prepare samples within a given state, it could be

applied equally well to experimental techniques, such as single molecule FRET and

force extension experiments. More broadly, minimizing uncertainties in a model is

likely to prove valuable even when metastability is not present. Similar methods may

also be useful for understanding other complex network dynamics, as in signaling

pathways.

108

CHAPTER 7: SIMULATED TEMPERING YIELDS INSIGHT INTO THE LOW-

RESOLUTION ROSETTA SCORING FUNCTIONS

This chapter was taken from: Bowman GR & Pande VS (2009) Simulated tempering

yields insight into the low-resolution Rosetta scoring functions. Proteins 74:777-788.

ABSTRACT

Rosetta is a structure prediction package that has been employed successfully in

numerous protein design and other applications (162). Previ-ous reports have

attributed the current limi-tations of the Rosetta de novo structure pre-diction

algorithm to inadequate sampling, particularly during the low-resolution phase (150,

151, 163, 164). Here, we implement the Simulated Tempering (ST) sampling

algorithm (24, 25) in Rosetta to address this issue. ST is intended to yield canonical

sampling by inducing a random walk in temperatures space such that broad sampling

is achieved at high tempera-tures and detailed exploration of local free energy minima

is achieved at low tempera-tures. ST should therefore visit basins in accordance with

their free energies rather than their energies and achieve more global sampling than

the localized scheme currently implemented in Rosetta. However, we find that ST

does not improve structure prediction with Rosetta. To understand why, we carried out

a detailed analysis of the low-resolution scoring functions and find that they do not

provide a strong bias towards the native state. In addition, we find that both ST and

standard Rosetta runs started from the native state are biased away from the native

state. Although the low-resolution scoring functions could be improved, we propose

that working entirely at full-atom resolution is now possi-ble and may be a better

option due to superior native-state discrimination at full-atom resolution. Such an

approach will require more attention to the kinetics of convergence, however, as

functions capable of native state discrimination are not necessarily capable of rapidly

guiding non-native conformations to the native state.

109

INTRODUCTION

Since the discovery that a protein’s structure is determined by its sequence (46), a

great deal of effort has been poured into trying to predict structure from sequence.

Thus far, knowledge-based approaches have proved promising, though more purely

physics-based structure predic-tion has potential (92). The Rosetta suite is one of the

most successful approaches, and employs a combination of knowledge-based

strategies and physical insight. Some of the more prominent achievements of this

software package are the design of a protein with novel topology (165), the redesign of

protein-protein interfaces (166), the redesign of protein-nucleic acid interfaces (167),

the redesign of a folding pathway (168, 169), aid in solving the crystallographic phase

problem (170), and, most recently, the design of new enzymes for reactions without

known biological catalysts (171).

It has been suggested that the success of Rosetta is in large part due to its

accurate scoring functions (151, 172). In the sense that many of the terms are based on

energetic principles derived from physical chemistry, one can think of the scoring

functions as energy functions. On the other hand, many of the terms are based on

statistics from the PDB databank. Because they are based on native protein structures,

which are assumed to represent the lowest free energy structures for a given sequence,

these terms implicitly consider entropic contributions. Thus, the scoring functions can

be thought of as free energy functions. In addition, the practice of clustering the lowest

scoring structures is, in a sense, taking into account entropy by considering the relative

populations of various states (173). To avoid confusion we will use the term ‘‘scoring

function.’’ This is probably the most precise term as the scoring functions are

primarily designed to discriminate native structures from non-native ones rather than

to reproduce physical behavior. Furthermore, it allows us to more clearly discuss the

conformational free energy under a given scoring function.

Rosetta uses a number of scoring functions in two distinct phases: low-

resolution and full-atom. This ‘‘hierarchical’’ approach (174) was incorporated into

110

Rosetta for CASP6 (175). The low-resolution phase assumes that the conformational

search of a protein is biased by local structural preferences and that the free energy

minimum is selected by nonlocal interactions (162, 176). This is captured by building

the protein structure from fragments drawn from native protein crystal structures.

Thus, local interactions may be assumed to be at free energy minima and a coarse-

grained sampling of the nonlocal free energy landscape may be carried out (176).

During this phase, sidechains are represented by single atoms called centroids, thus

sacrificing atomic resolution for rapid sampling. All of the scoring functions employed

in this phase are dominated by the hydrophobic effect (162, 164) and are intended to

give the correct topology (162, 176). Full-atom refinement employing a single scoring

function is then carried out on each low-resolution model (151). This phase is intended

to give atomic resolution with correct packing (162). However, the full-atom scoring

function only tends to give accurate results when the starting low-resolution model is

within 3A of the native state, the ‘‘radius of convergence’’ (163, 174). Thus, the full-

atom phase is highly dependent on the success of the low-resolution phase. Together,

these two phases represent the belief that the native state lies at the bottom of a deep

minimum at the center of a broader basin (162, 173).

A number of recent works have claimed that the main challenge preventing

better structure prediction with Rosetta is sampling, particularly in the low-resolution

phase (150, 151, 163, 164). They suggest that improved sampling at low-resolution

would give more structures within the radius of convergence, and thus better full-atom

structures.

To address this issue, we have implemented the Simulated Tempering (ST)

sampling algorithm in the low-re-solution phase of Rosetta. This algorithm is intended

to allow rapid barrier crossing by performing a random walk in temperature space. At

high temperatures broad sampling may be achieved, while at low temperature various

free energy minima may be explored. ST is a serial algorithm so it is amenable to an

automated distributed computing effort like Rosetta@home (151), whereas related

parallel algorithms like the Replica Exchange Method (REM) are not (27, 177).

111

METHODS

OVERVIEW OF ROSETTA

The standard Rosetta de novo structure prediction protocol (RSP) is designed to

predict the structure of a protein given its sequence. The algorithm begins with a fully

extended chain. First, a low-resolution phase is car-ried out in which side chains are

represented with cent-roids, single atoms which recapitulate the properties of the

sidechain. The centroids are located at the center of mass of the sidechain obtained

from averaging over all the conformations found in the PDB databank. A Monte Carlo

approach is used to substitute in segments from fragment libraries provided by the

user.

The fragment libraries consist of possible three- and nine-residue segments

from the PDB databank that match portions of the sequence. By default, 200 three-

residue and 200 nine-residue fragments are included for each overlapping segment of

the protein (176). Secondary structure predictions from PSIPRED (178), JUFO (179),

SAM (180), and PROF (181), are used to guide the selection of these seg-ments (151).

These fragments are chosen such that the pro-portion of possible helix, strand, and

other configurations is equal to the average prediction of all the secondary structure

prediction programs used (176). One may install the software for generating these

segments locally or, as in the case of this work, use the Robetta server

(http://robetta.bakerlab.org/). Three- and nine-residue sequences are used as they have

the most significant correlations in local structure (182).

Only the torsion angles are modified when a fragment is inserted. Bond lengths

and angles are held constant. The values for the bond lengths and angles are taken

from CHARMM19 (183, 184). When generating the fragment libraries, the torsion

angles are modified from those in the PDB databank to maintain consistency with

these ideal bond lengths and angles (176).

112

Three major factors are intended to guide the algo-rithm to the native structure:

1) a series of scoring func-tions based on distributions from the PDB Databank and

Bayesian inference (48, 172), 2) returning to the lowest scoring structure found thus

far at regular intervals, and 3) a temperature schedule called quenching that is

designed to detect and escape local minima.

The possible components of each scoring function are described in detail

elsewhere (176). Since each bit of local structure comes directly from native proteins,

it is assumed to be at a free energy minimum. Thus, the low-resolution scoring

functions focus on giving a rapid coarse-grained approximation of the free-energy

landscape for nonlocal interactions and are meant to find the global topology (162).

One of the major driving forces is hydrophobic burial (151, 164).

Figure 30 shows the order in which the scoring functions are used. The final

low-resolution de novo structure prediction scoring function, score4, is supposed to be

able to distinguish native structures. The other scoring functions are mainly variants of

score4 meant to help bias the structure towards the native state as quickly as possible.

Rosetta begins with one or two cycles of 2000 Monte Carlo steps with score0, which

only has a Van der Waals term. This scoring function serves to insert a fragment in

each position of the extended chain in order to provide a more or less random starting

point for the subsequent scoring functions. Next, a single cycle of 2000 steps is carried

out with score1, which is meant to accu-mulate secondary structure (176). Rosetta

then performs five repetitions of a 2000 step cycle with score2 followed by a 2000

step cycle with score5. Score2 includes terms to favor collapse and beta strand pairing

while score5 is similar but lacks these two terms to allow some relaxation. Three

cycles of 4000 steps each are then carried out using score3. Score3 includes all of the

possible low-resolution (or centroid) terms, except for hydrogen bonding. The first

cycle of score3 uses the normal fragment insertion scheme. The remaining cycles use

smoothing steps as described by Rohl et al. (176) to make small perturbations that

relax the structure. Finally, score4, which does not have any compaction or beta-strand

pairing terms, is used to rank the lowest scoring structure seen so far.

113

Figure 30. Flow chart showing the order the scoring functions are used in and giving brief descriptions

of each. After score5, Rosetta returns to score2 five times before progressing to score3. The first six

scoring functions constitute the low-resolution de novo structure prediction phase.

Beginning with score2 Rosetta returns to the lowest scoring structure seen so

far at the end of each cycle (approximately every 2000 steps). The temperature is

implicitly in units of kT. By default the temperature is initially set to 2 kT and is

updated using a quenching scheme. If 150 steps are performed without any being

accepted then it is assumed that a local minimum has been reached and the

temperature is increased by 1 kT, thus increasing the probability of accepting

subsequent moves (176). As soon as a step is accepted, the temperature is quenched:

that is, the temperature is immediately reset to 2 kT.

114

At present, it is standard practice to perform full-atom refinement on each of

the low-resolution models (151). An example command-line for generating a low-

resolution model of protein G and then refining it is as follos:

rosetta.gcc64 aa 1igd A -verbose -silent

-increase_cycles 10 -new_centroid_packing

-abrelax -output_chi_silent -stringent_relax

-vary_omega -omega_weight 0.5 -farlx

-ex1 -ex2 -termini -short_range_hb_weight 0.50

-long_range_hb_weight 1.0 -no_filters

-rg_reweight 0.5 -rsd_wt_helix 0.5

-rsd_wt_loop 0.5 -output_all -accept_all

-do_farlx_checkpointing -relax_score_filter

-record_irms_before_relax -acceptance_rate 1.0

-filter1a 10000 -filter1b 10000 -nstruct 1

-constant_seed -jran 1918492

The purpose of refinement is to get atomic level accuracy with correct packing

of sidechains (162). Exploring the full-atom free energy landscape is considerably

more expensive than exploring the low-resolution one because of the atomic resolution

and the inclusion of local interaction terms. To minimize the computational expense, it

is assumed that the low-resolution starting model has the correct topology and only

conservative backbone moves are made (176). It is hoped that these conservative

moves will also help to get adequate exploration in the context of a compact chain

115

where large moves are likely to cause clashes. The backbone moves include small

random perturbations to single torsion angles, alterations of a series of torsion angles

such that the global structure is preserved, and gradient descent. The torsion angle

potential is based on distributions from the PDB databank (151). Correct packing is

achieved by rotamer optimization (162, 163, 185). Solvation effects are captured by

employing the EEF1 implicit solvent (186). Even with the assumption that the starting

model has the correct topology, refining every model is still very expensive and is

only made possible through the use of distributed computing on Rosetta@home (151).

An important addition to the full-atom scoring func-tion is a direction-

dependent hydrogen bonding term (187). This potential term is based on distributions

from the PDB and has been shown to provide better native state discrimination than

Coulomb-based hydrogen bonding terms like those found in standard molecular

dynamics packages (187) and to agree with quantum calculations (188). Both

backbone-backbone and sidechain hydrogen bonds are included but the backbone-

backbone terms have been found to provide the best native state discrimination (187).

Hydrogen bonds are short range interactions, so it is not surprising that this potential

gives the best discrimination for decoys within 1-3A of the native state.

A similar hydrogen bonding term is also included in a brief low-resolution

relaxation performed before beginning the full-atom refinement. This low-resolution

scoring function is called score6. Relaxation in this scoring function uses a

conservative move set similar to that in full-atom relaxation.

The final prediction is made by performing an RMSD (RMSD over Cα atoms)

clustering of the 100 to 1000 lowest scoring full-atom models (151) and selecting

those with the greatest number of neighbors within a cutoff that depends on the size of

the protein (173).

116

MODIFICATIONS TO ROSETTA

SIMULATED TEMPERING

In this work, the Simulated Tempering (ST) sampling algorithm (24, 25) was

implemented in place of the default quenching temperature schedule. ST allows the

system to perform a random walk in temperature space with ca-nonical sampling at

each temperature. At high temperatures the free energy landscape is flattened,

allowing broad sampling of conformation space. At low temperatures, barriers are

present and tend to confine the system to exploring a single free energy minimum. By

performing a random walk in temperature space a single run is able to explore

multiple minima, thus speeding convergence. For a detailed derivation of ST refer to

Huang et al. (177) and the original works (24, 25).

ST requires an initial temperature, a list of possible temperatures, and a list of

weights for each temperature as inputs. For this work the possible temperatures are

0.1, 0.25, 0.5, 0.75, 1, 2, 3, 5, 10, and 20 kT. At regular intervals an attempt is made to

change temperatures. For each attempt, the algorithm randomly decides to go either up

or down in temperature. The probability of accepting the attempt is

,1min)( )()()( ijij ggXUejiP (7.1)

where, P(i→j) is the probability of transitioning from temperature i to temperature j,

)/(1 iBi Tk , U(X) is the potential energy of conformation X (or in this case the

score), and gi is the weight of temperature i. Assuming the weights are properly

selected, this probabilistic temperature changing ensures that the detailed balance

condition for equilibrium is satisfied. That is,

)()()()( ijPXPjiPXP ji (7.2)

117

where, Pi(X) is the probability of conformation X at temperature i and P(i→j) is the

probability of transitioning from temperature i to temperature j. In addition, it can also

be shown that for a correct set of weights

)()( ijPjiP (7.3)

where, P(i→j) is the probability of transitioning from temperature i to temperature j.

Furthermore, ensuring that Eq. (7.3) is satisfied is sufficient to yield correct weights

(189).

From Eq. (7.1) it is evident that the probability of making a temperature

change is controlled by the energy distribution (or in this case score distribution),

temperature spacing, and the difference between the weights of a pair of neighboring

temperatures. To choose the temperature list and an initial set of weights constant

temperature runs were carried out at a variety of temperatures ranging from 0.1 to 20

kT. Temperatures were selected such that weights could be found yielding

. Twenty iterations of 100 runs at 10 times the

default length were then carried out, updating the weights after each iteration to satisfy

Eq. (7.3). This protocol yielded converged weights for all of the systems studied and is

thus a plausible candidate for a fully automated system compatible with a distributed

computing environment like Rosetta@home.

5.0)()( ijPjiP

OTHER OPTIONS

The user may change the frequency of temperature change attempts, the frequency of

outputting structures throughout a given run, and whether the program recovers the

lowest scoring structure seen thus far between cycles. Another option allows the

exploration of the final scoring function alone, thus removing any bias from the other

scoring functions or the regular returns to the low-est scoring structure.

118

STRUCTURE PREDICTION PROTOCOLS

Structure prediction is carried out employing the same procedure used by the Baker

lab in CASP7 (151), though full-atom refinement was only used in a subset of cases.

For each structure 10,000 independent runs are carried out using 10 times the default

number of steps for each cycle. Each low-resolution run took about 2 min on an Intel

E5345 quad-core 2.33 GHz processor. Full-atom refinement of a model took an

additional 4 min. The lowest scoring structure from each run is stored. All of these

structures are clustered by RMSD and the top five cluster centers are selected as the

best predictions.

A similar procedure was used for comparing the various scoring functions. In

this case, 1000 independent runs each with 100 times the default number of steps were

performed to ensure adequate exploration of the whole space. Each RSP run using the

full sequence of scoring functions took about 15 min on an Intel E5345 quad-core 2.33

GHz processor while using just score4 took 30 min. The ST variant took 30 min and 1

hr when using all the scoring functions and just score4 respectively. For ST, an

independent set of weights is used for each scoring function to ensure canonical

sampling of each of them.

To characterize the native state, the crystal structure was idealized and relaxed

50 times. Idealization consists of setting the bond lengths and angles to ideal values.

Relaxation is carried out to compensate for any deleterious effects resulting from the

idealization process. Relaxation may be carried out in either the low-resolution score6

space or the full-atom scoring function using the conservative move set described in

the ‘‘Overview of Rosetta’’ Section. The native structure used as the starting point for

many of the runs described below was the one with the lowest RMSD out of the 50

idealization/relaxation runs carried out in the low-resolution score6 space.

˚ For protein G, this structure has an RMSD of 0.88 Å. Projections of the free

energy landscapes from ST runs were generated using the Multistate Bennett

Acceptance Ratio (MBAR) estimator (190), a variant of the Weighted Histogram

119

Analysis Method (WHAM) (191), to make use of data from all of the temperatures.

Projections from RSP were generated using all the data. These plots are analogous to

free energy landscapes but we note that RSP does not guarantee a canonical

distribution.

RESULTS

COMPARISON OF ST AND STANDARD ROSETTA

The structure prediction protocol was carried out for four systems of varying size: an

SH3 domain (58 resi-dues, PDB code 1shf (192)), protein G (61 residues, PDB code

1igd (193)), ubiquitin (76 residues, PDB code 1d3z (194)), and a zinc finger (136

residues, PDB code 2j6a (195)). All are X-ray structures with a resolution less than 2

Å with the exception of ubiquitin, which is the sole NMR structure evaluated in this

work.

SH3 DOMAIN

In a previous work Bradley et al. found poor results for the SH3 domain studied here

(150). They attributed these poor results to inadequate sampling, thus, this target

seemed like a good test for enhanced sampling. The ST results proved to be

qualitatively similar to those from RSP, as shown in Figure 31. Both algorithms

identified a pronounced score minimum at high RMSD, which may explain the poor

results from the previous study. Despite this similarity, ST did yield slightly broader

sampling of the low score space. Figure 31 also shows black plus-signs corresponding

to the idealized native structure relaxed with the score6 low-resolution scoring

function, which includes a direction dependent hydrogen bonding term, and then

scored with the final scoring function, score4. This data indicates that the native

structure is stable in score6 but is not recognized as native by the final scoring

function.

120

Figure 31. Score versus RMSD (Å ) for an SH3 domain (PDB code 1shf). Each diamond represents the

lowest scoring structure for a single run. Data for ST is shown in blue while data for standard

Rosetta is shown in red. The black ‘‘+’’ symbols represent models obtained by idealizing and

relaxing the crystal structure in low-resolution mode.

PROTEIN G

Protein G was chosen as a small and tractable target. Figure 32(A) shows the results of

low-resolution de novo structure prediction while panel (B) shows the results for full-

atom refinement of the low-resolution models. Both RSP and the ST variant perform

well on this system, finding low scoring structures with RMSD values as low as 1.5 Å.

The lowest scoring de novo structures had RMSD values of about 2 Å but there is a

clear correlation between low score and low RMSD. The ST and RSP results were

qualitatively identical for both low-resolution runs and full-atom refinement. Full-

atom refinement does not appear to greatly change the RMSD. On average the RMSD

changed by -0.03 Å with a standard deviation of 0.3 Å between the low-resolution and

full-atom phases. The black plus-signs in Figure 32(A) demonstrates that the native

structure is stable in score6 but not recognized by score4, as was the case for 1shf.

Figure 32(B), however, shows that the full-atom scoring function assigns low scores to

the idealized native structure when relaxed in the full-atom scoring function (yellow

circles). Furthermore, the low-resolution relaxed native structures that were not

recognized by score4 are assigned low scores after relaxation with the full-atom

121

scoring function (black asterisks). In fact, these structures are closer to the full-atom

idealized/relaxed structures than any of the de novo structures.

Figure 32. Score versus RMSD (Å ) for protein G (PDB code 1igd). Each diamond represents the

lowest scoring structure for a single run. Data for ST is shown in blue while data for standard

Rosetta is shown in red. Panel (A) shows results from the low-resolution phase. The black ‘‘+’’

symbols represent models obtained by idealizing and relaxing the crystal structure in low-

resolution mode. Panel (B) shows results from the full-atom phase. The yellow circles represent

models obtained by idealizing and relaxing the crystal structure in full-atom mode. The black ‘‘*’’

symbols are full-atom models obtained by relaxing the low-resolution structures depicted by ‘‘+’’

symbols in (A) using the full-atom scoring functions.

UBIQUITIN

Ubiquitin was selected due to its larger size and to evaluate the accuracy on NMR

structures. Both Rosetta variants gave equivalently good results on this system (data

not shown). Like protein G, there was a general correlation between score and RMSD

and structures with as low as 2.5 Å RMSD were reached. Once again, the idealized

and relaxed native structure was not assigned a low score by score4.

ZINC FINGER

Finally, the 136 residue zinc finger was a CASP7 tar-get selected to push the limits of

the algorithms. Rosetta tends to perform well on proteins with less than 100 residues

122

(151). If ST is indeed giving enhanced sampling then one would expect for it to

outperform RSP on larger systems. However, once again both Rosetta variants gave

equivalent and poor results, with RMSD values no less than 10 A, and the idealized

and relaxed native structures were not recognized by score4 (data not shown).

The close agreement between the ST results and those of RSP in all cases

indicates that ST is not giving enhanced sampling. The most probable explanations are

that ST is not capable of escaping free energy minima in the Rosetta scoring functions

or that RSP is correctly identifying all the accessible minima of the scoring functions.

To explore these alternatives a more extensive analysis of the protein G results is

conducted as this small system allows many trials to be run. Results presented by

Alena Shmygelska and Michael Levitt at our Structural Biology Retreat on Nov. 15,

2007 showed that both Temperature replica exchange and Hamiltonian replica

exchange do sample significantly better than the original Rosetta Monte Carlo method.

VALIDATION OF ST

Figure 33 shows an example of the evolution of the weights throughout the weight

determination protocol, demonstrating that it yields converged weights. The difference

in weights is plotted as it is this difference, and not the absolute value of the weights,

that determines the acceptance probability. The weight differences at high

temperatures converge very quickly, consistent with a more or less flat free energy

surface allowing broad sampling. The weight differences at low temperatures con-

verge more slowly, consistent with a more rugged land-scape and restricted sampling.

Independent runs of the weight determination procedure also produced more or less

equivalent results (data not shown). The convergence of the weights is good evidence

that the algorithm is working properly. These converged weights yield equal sampling

of temperature space and multiple visits to both high and low temperatures in each

run.

123

Figure 33. Evolution of the score4 weights for protein G. The dashed line is the difference between the

weights of the highest two temperatures: 10 and 20 kT. The solid line is the difference between the

weights of the lowest two temperatures: 0.1 and 0.25 kT. The first points come from constant

temperature runs and subsequent points represent each iteration of refining the weights. Δg=gj-gi

where, j > i.

Figure 34(D)–(F) show projections of the free energy surface onto score versus

RMSD for protein G runs started from the native structure at a range of temperatures

(0.1, 2, and 20 kT, respectively). Figure 34(D) shows that at low temperature, the

system spends considerably more time at low score and low RMSD. The higher scores

and RMSDs at high temperature show that high enough temperatures are being

reached for the system to escape local minima and achieve broad sampling.

124

Figure 34. Projections of the free energy landscape onto score versus RMSD (Å ) for protein G in

score4 using: (A) standard Rosetta runs starting from an extended chain, (B) standard Rosetta runs

starting from the native state, (C) ST runs at 0.1 kT starting from an extended chain, (D) ST runs at

0.1 kT starting from the native state, (E) ST runs at 2 kT starting from the native state, (F) ST runs

at 20 kT starting from the native state. Each white plus-sign corresponds to the lowest scoring

structure for a single run. The lowest scoring structures from each run were sorted by RMSD and

only every twentieth point is shown so as to give the entire range without obscuring the underlying

plot.

THE FINAL SCORING FUNCTION

In theory, the sequence of scoring functions employed by Rosetta was chosen to bias

the system to low score and low RMSD as quickly as possible. The final low-

resolution scoring function (score4) is only applied to the lowest scoring structure

125

found while exploring the previous scoring function (score3). Thus, it is difficult to

judge whether or not the system is truly being biased towards the global free energy

minimum of score4. To test this, both ST and RSP were applied to the final scoring

function in isolation.

Figure 34 shows that this analysis yields qualitatively similar results for both

ST and RSP, particularly at low score and low RMSD. The agreement between the

results generated with the same algorithm but different starting states demonstrates

that the landscapes are con-verged and, therefore, represent the entire accessible space.

The agreement between the ST and RSP results suggests that both algorithms are

identifying the free energy minimum. Furthermore, the ST results show that running at

lower temperatures does not significantly shift the free energy minimum towards the

native state. Limiting the temperature range used by ST to 0.1–3 kT to more closely

parallel the temperatures explored by RSP gave similar results (data not shown).

A rough time course of the evolution of the RSP landscapes was found by

plotting projections of the free energy landscape for the first third of each run, the

second third, and the final third. All three plots were identical to Figure 34(A),(B)

(data not shown), further supporting the conclusion that the landscape is converged

and indicating that the algorithm is capable of crossing any barriers present on short

timescales. Analyzing the tem-perature throughout the RSP runs shows that 90% of

the runs increased their temperature to 3 kT at some point but that only about 10%

increased their temperature to 4 kT and none reached temperatures greater than 6 kT.

Moreover, less than 10% of the time was spent at temperatures greater than 2 kT.

The global minimum in this projection covers a range of about 5–20 Å. The

differences present between the RSP and ST results are due to the greater temperature

range and approximately equal sampling at each temper-ature in ST. Visits to lower

RMSD values rapidly become less likely below about 5 Å. Conformational clustering

was carried out to confirm that the projection onto score versus RMSD is not hiding a

highly populated minimum closer to the native state. The centers of the top 10 most

126

populated clusters were found to fall within the minimum of the landscape, confirming

the validity of this projection. Thus, the final scoring function provides only a small

bias towards the native state and has only small barriers.

Figure 34 also includes white plus-signs for the lowest scoring structure visited

for each run. Once again, there is strong agreement between RSP and ST started from

both native and extended structures. These points fall in a range of about 2.5–15 Å.

The spread is slightly larger for ST runs, which is to be expected given that more time

is spent at higher temperatures.

The large variation of the score with RMSD is also of interest. It seems that

while low scores are correlated with low RMSD values high scores are not necessarily

indicative of high RMSD values.

THE OTHER SCORING FUNCTIONS

Given this analysis of the final scoring function, it is interesting to ask where the bias

towards the native state seen in the RSP results comes from. Two possible answers

are: (1) the sequence of other scoring functions and (2) the frequent returns to the

lowest scoring structure found so far. To explore these possibilities 1000 runs at 100

times the default length were carried out with both ST and RSP with and without the

regular returns to the lowest scoring structure.

Figure 35 shows projections of the free energy landscape onto score versus

RMSD for each of the Rosetta scoring functions. The first two columns show general

agreement between RSP runs with and without returns to the lowest scoring structure.

Strong agreement between ST runs with and without returns was also found (data not

shown). General agreement is found between the RSP and ST results as well. Though

ST has a stronger bias towards lower scoring structures due to the lower temperatures

reached, there is no apparent bias towards lower RMSD structures. Figure 35 also

includes white plus-signs corresponding to the lowest scoring structure for a single run

in the given scoring function. Although these points are slightly more localized to low

127

RMSD when the low scoring structure is regularly recovered, lower RMSD values are

not reached. Thus, it would seem any bias comes from the sequence of scoring

functions employed, though returning to the lowest scoring structure between cycles

may speed up the process slightly.

Figure 35. Projections of the free energy landscape onto score versus RMSD (Å ) for protein G. Each

white plus-sign corresponds to the lowest scoring structure for a single run. The lowest scoring

structures from each run were sorted by RMSD and only every twentieth point is shown so as to

give the entire range without obscuring the underlying plot. (A), (D), (G), and (J) show data from

standard Rosetta runs with frequent recovery of the lowest scoring structure in score1, score2,

score5, and score3 respectively. (B), (E), (H), and (K) show data from standard Rosetta runs

128

without frequent recovery of the lowest scoring structure in score1, score2, score5, and score3

respectively. (C), (F), (I), and (L) show data from ST runs at 0.1 kT without frequent recovery of

the lowest scoring structure in score1, score2, score5, and score3, respectively.

In fact, Figure 35 indicates that this is so. Score1 shows the broadest sampling

and the least bias towards low score and RMSD. Score2, on the other hand, has a

global free energy minimum at about 5 Å. Score5 once again allows slightly broader

sampling. Relatively broad sampling is achieved by score3 as well, however, this

scoring function reaches the lowest RMSD values and is the only one to have a global

minimum extending well below 5 Å. Based on these results it appears that score2 and

score3 provide the greatest biasing force towards the native state. The main

distinguishing feature of these scoring functions is the inclusion of compaction and

beta-strand pairing terms.

DISCUSSION

In previous works, many of the limitations of Rosetta have been attributed to

inadequate sampling, particularly at low-resolution (150, 151, 163, 164). However, the

failure of the Simulated Tempering (ST) sampling algorithm to give any improvement

on a range of targets indicates that this may not be the case. The fact that ST does not

yield any improvement on a larger target where any increase in sampling would be

most beneficial is particularly noteworthy. Plots of the free energy landscape at

varying temperatures demonstrate that ST is working and that these results are not just

due to a failure to reach sufficiently high temperatures.

The results presented in this work indicate that the low-resolution scoring

functions are the main limitations to Rosetta’s performance, and not the sampling

methods employed. This conclusion is supported by the fact that the final low-

resolution scoring function (score4) fails to recognize the native structure for any of

the targets examined. Furthermore, the free energy landscape for the score4 function

has a broad minimum ranging from 5–20 Å RMSD. This landscape is judged to be

129

converged based on agreement between data generated from two different starting

conformations and a rough time course of the landscape. If the low-resolution scoring

functions were capable of recognizing the native state but it were just very difficult to

get there one might expect ST runs starting from the native state to give worse results

than standard Rosetta runs started from the native state because increasing the

temperature would cause unfold-ing without subsequent refolding. The fact that

standard Rosetta runs in score4 started from the native state are no more likely to

identify near-native states highlights the bias of the low-resolution scoring functions

away from the native state. Two of the other low-resolution scoring functions are

found to have minima closer to the native state, presumably due to the inclusion of

compaction and beta-strand pairing terms. However, this minimum at around 5 Å is

still insufficient to satisfy the 3 Å radius of convergence required by full-atom refine-

ment (163, 174). In principle, lower temperatures may promote more sampling at

lower scores but the ST results at 0.1 kT show that in practice this doesn’t provide

much, if any, improvement over the standard 2 kT quenching runs. Finally, plotting

the free energy landscape of structures throughout numerous runs rather than just a

scatter plot of the lowest scoring structure from each run demonstrates that there is

only weak correlation between low score and low RMSD. Although low scores may

be indicative of low RMSD values, high scores are not necessarily correlated with

high RMSD values. Together, these results indicate that the low-resolution scoring

functions do indeed allow rapid and complete exploration of a coarse grained

landscape as intended but this landscape does not have the desired near-native free

energy minimum.

The conclusion that the low-resolution scoring functions are the main limiting

factor in Rosetta is also sup-ported by a number of recent works. For example, Misura

et al. note that generating more low-resolution models and then refining them did not

improve the accuracy of de novo structure prediction while performing more

independent refinement runs of each low-resolution model did (163). This observation

seems to indicate that sampling is a problem at high-resolution but not low-resolution.

130

Furthermore, the presence of false minima in the low-resolution scoring function has

been acknowledged (150, 151).

One approach to addressing these issues would be to improve the low-

resolution scoring function. This could be accomplished through improving either the

thermodynamics, kinetics, or both. On the thermodynamic front, the ideal case would

be to have a scoring function free energy minimum near the native state. This would

ensure that each run would be likely to find a near-native state and preferably spend a

significant amount of time sampling near-native regions. One way to achieve this goal

would to make native and near-native states have even lower scores to compensate for

the apparent entropic benefits of higher RMSD structures.

An alternative approach would be to think more about kinetics. At present

most runs seem to find a lower scoring structure than those in the free energy

minimum but do not appear to spend a great deal of time there. Assigning even lower

scores to these structures could bias the sampling towards these regions. However, it

may be the case that it is just too difficult to get to these structures. To illustrate, one

can imagine that a map showing the locations of cities but no roads would give an

accurate indication of the distance between two points but no indication of the fastest

route between them. Likewise, assigning a structure a low score may accurately

identify it as a native-like structure but not solve the problem of getting to that

structure from an extended conformation.

One final way of improving the low-resolution scoring functions would be to

improve the move set. In one recent work, it is acknowledged that the ‘‘single-

fragment insertion approach makes many global conformers dynamically

inaccessible’’ (176). Incorporating more conservative moves could improve results.

However, making smaller moves would slow down exploration of the space, defeating

the purpose of having a low-resolution phase in the first place.

131

Alternatively, one could forego the low-resolution phase altogether in favor of

increased sampling at full-atom resolution. The hierarchical approach was developed

in recognition of the fact that using more detail from the beginning was too expensive

but that low-resolution models in isolation make prohibitive simplifications (174).

However, creating an adequate and generalizable low-resolution model may not be

worth the cost and effort. If the low-resolution scoring functions are not accurate

enough then they will tend to bias the structure away from the native-state. At present,

the low-resolution phase gives structures in the range of 3–6 Å starting from an

extended chain (176) and this work shows that starting from the native-state gives

equivalent results.

Furthermore, the full-atom refinement carried out in this work gave negligible

changes in RMSD, though other works have claimed 1.5–4 Å changes (163). In either

case, the low-resolution phase is unlikely to give many structures close enough to the

native backbone structure. And, without the correct backbone structure it is nearly

impossible to get the correct packing (162). In addition, the problems found in the

low-resolution phase are compounded by inaccurate secondary structure prediction

(151). This dependency could be removed by working solely at full-atom resolution.

The sampling required for such an endeavor is daunting, but recent distributed

computing efforts such as Folding@home (79) and Rosetta@home (151) may make it

feasible. Furthermore, the fact that the Rosetta full-atom scoring function is starting to

yield improvements in homology modeling shows that it’s accuracy is promising

(151). Finally, some recent work in purely full-atom structure prediction without the

sampling power of a distributed computing platform shows that this approach may be

viable (196, 197). One can even imagine a new hierarchical approach in which a large

move set full-atom phase is followed by a second phase with more conservative

moves. Of course, developing such a method would still require careful attention to

the difference between native state discrimination and a function capable of guiding a

non-native state to the native state.

132

CONCLUSIONS

We have implemented the Simulated Tempering (ST) sampling algorithm in Rosetta

to test whether improved sampling in the low-resolution phase can improve Rosetta

structure prediction. The low-resolution Rosetta scoring functions are shown to be

adequately sampled by both standard Rosetta and a Simulated Tempering variant.

Agreement between data generated from an extended and a native conformation

supports the conclusion that the entire space is being sampled. Similar agreement

between the results from both algorithms indicates that the scoring functions do not

have near-native free energy minima. Thus, the low-resolution scoring functions, and

not sampling at low-resolution, are the main limitation to accurate Rosetta structure

prediction.

Structure prediction with Rosetta may be improved by correcting the low-

resolution scoring functions. However, given current computational resources, such as

Folding@home and Rosetta@home, it may be time to work at full-atom resolution

from the beginning. Such an endeavor would require careful consideration of kinetics.

Although functions designed for native-state discrimination may be able to correctly

distinguish between native and nonnative conformations, that does not necessarily

indicate that they are well suited for guiding nonnative conformations to the native

state. Even if one does not care about predicting physical kinetics (e.g. the rate of

folding in simulation compared with experiment), rapid kinetics of reaching the native

state is crucial for the convergence of the simulation results and the general efficiency

of the method.

CHAPTER 8: THE ROLES OF ENTROPY AND KINETICS IN STRUCTURE

PREDICTION

This chapter was taken from: Bowman GR & Pande VS (2009) The roles of entropy

and kinetics in structure prediction. PLoS One 4:e5840.

ABSTRACT

Here we continue our efforts to use methods developed in the folding mechanism

community to both better understand and improve structure prediction. Our previous

work demonstrated that Rosetta’s coarse-grained potentials may actually impede

accurate structure prediction at full-atom resolution. Based on this work we postulated

that it may be time to work completely at full-atom resolution but that doing so may

require more careful attention to the kinetics of convergence. To explore the

possibility of working entirely at full-atom resolution, we apply enhanced sampling

algorithms and the free energy theory developed in the folding mechanism community

to full-atom protein structure prediction with the prominent Rosetta package. We find

that Rosetta’s full-atom scoring function is indeed able to recognize diverse protein

native states and that there is a strong correlation between score and Cα RMSD to the

native state. However, we also show that there is a huge entropic barrier to folding

under this potential and the kinetics of folding are extremely slow. We then exploit

this new understanding to suggest ways to improve structure prediction. Based on this

work we hypothesize that structure prediction may be improved by taking a more

physical approach, i.e. considering the nature of the model thermodynamics and

kinetics which result from structure prediction simulations.

134

INTRODUCTION

In 1961 Anfinsen demonstrated that the native state of a protein is encoded in its

amino acid sequence and hypothesized that the native state is the lowest free energy

state (46). Since then, many researchers have dedicated their careers to understanding

the driving forces underlying protein folding in order to 1) predict the native states of

proteins from their amino acid sequences and 2) understand the mechanisms and

pathways by which proteins fold. Collectively, these components constitute the protein

folding problem (53, 70).

The protein structure prediction community has generally focused on finding a

protein’s native state based on its sequence. A typical approach is to develop a

knowledge-based scoring function to discriminate native structures from non-native

ones and to sample this potential in search of the global minimum (198). For example,

the Rosetta structure prediction package uses a Monte Carlo (MC) scheme to sample a

series of scoring functions with increasing levels of chemical detail in order to identify

protein native states (48, 49, 150). In Rosetta and many other structure prediction

schemes, the problem of finding the free energy minimum is simplified by focusing on

the energetic (or score) term (199). We note that Rosetta includes a simple implicit

solvent and some implicit accounting for entropy by using information from known

structures but stress that it does not explicitly account for conformational entropy. This

simplification is justified by arguing that the conformational entropy of the native state

is negligible and, therefore, the energetic term must be the dominant factor favoring

the native state and the energy minimum should be equivalent to the free energy

minimum. This approach has proved remarkably successful and has resulted in the

design of a protein with a novel fold (165), accurate high-resolution structure

predictions for small globular proteins (151), and the design of novel enzymes (171).

However, ignoring conformational entropy will have increasingly deleterious effects

on the landscape as one moves away from the native state and this may ultimately

prevent accurate structure prediction for more complex systems.

135

In contrast, researchers studying folding mechanisms have placed less

emphasis on predicting native states and focused on understanding how proteins fold.

This work is also based on potentials, or force fields. However, these potentials have

been designed to reproduce our physical reality rather than to simply discriminate

native and non-native protein structures. Furthermore, much emphasis has been placed

on understanding the entire free energy landscape and the kinetics of traversing this

landscape (53). To accomplish these objectives numerous advanced sampling

algorithms have been developed (21), as well as methods to visualize free energy

landscapes (52) and determine whether or not they represent the true equilibrium

distribution of the system under the given potential (177).

Here we continue our efforts to use methods developed in the folding

mechanism community to both better understand and improve structure prediction.

Our previous work demonstrated that Rosetta’s coarse-grained potentials may actually

impede accurate structure prediction at full-atom resolution (49) and this result has

been confirmed by other researchers (200). Based on this work we postulated that it

may be time to work completely at full-atom resolution but that doing so may require

more careful attention to the kinetics of convergence. To explore this possibility, we

have used Generalized Ensemble (GE) algorithms (21) to generate projections of the

landscape defined by Rosetta’s full-atom scoring function. We find that these scoring

functions are capable of recognizing the native states of both protein G and engrailed

homeodomain, an α/β and all α-helix protein, respectively. Furthermore, the score has

the desired correlation with Cα RMSD to the native state. However, there is a huge

entropic barrier to folding and the hydrogen bonding potential does not provide any

significant bias towards the native state, slowing the kinetics of convergence. Based

on these insights, we believe that further advances in structure prediction may be made

by taking advantage of methods and ideas developed in the folding mechanism

community.

136


GENERAL APPROACH

In order to gain a deeper understanding of Rosetta’s full-atom resolution scoring

function we have implemented a variant of the Simulated Tempering (ST) algorithm

(24, 25) in Rosetta. ST was originally intended to induce the system of interest to

perform a random walk in temperature space so that broad sampling at high

temperatures would improve mixing at lower temperatures. However, ST may be

generalized to other spaces (24). Here we define an RMSD space consisting of a

number of umbrellas constraining the system to a given Cα RMSD from the native

state. ST is then used to induce the system to perform a random walk in RMSD space

without making any alterations to the temperature (201). Furthermore, we only use

MC moves rather than the combination of MC and minimization moves used in the

standard Rosetta protocol. Thus, the system can move back and forth between the

folded and unfolded states while remaining at equilibrium. Exchanging between

umbrellas also allows the system to access all the possible conformations in a given

RMSD range (202). By performing many simulations in parallel we hope to explore

all the relevant folding pathways. Figure 36 shows that this procedure results in

reversible folding (i.e. multiple folding and unfolding events), confirming that our

simulations have reached convergence (203). The Multistate Bennett Acceptance

Ratio (MBAR) method (190), a statistically optimal variant of the Weighted

Histogram Analysis Method (WHAM) (191), is used to determine the unbiased

average values of thermodynamic properties such as energies and conformational

entropies as a function of the RMSD. All the thermodynamic measurements in this

work are dimensionless. That is, energies and free energies are given in units of the

thermal energy kT and entropies are given in units of the Boltzmann constant k.

137

Figure 36. Time evolution of the Cα RMSD of the current umbrella center for five representative

simulations demonstrating the presence of reversible folding.

We have applied this method to two systems: protein G (PDB code 1igd) (193)

and engrailed homeodomain (PDB code 1enh) (204). Protein G has an α/β fold while

engrailed homeodomain (EH) is a 3-helix bundle. Because these systems contain both

major protein secondary structure motifs our conclusions should be applicable to most

protein systems.

A THERMODYNAMIC PERSPECTIVE

The average energy (or score), conformational entropy, and free energy as a function

of the RMSD for both protein G and EH are shown in Figure 37. The average score

has a clear correlation with the RMSD and the native state is at the scoring function’s

global minimum for both systems. Thus, Rosetta’s full-atom scoring function is

indeed able to recognize diverse protein native states. However, the conformational

entropy of the native state is extremely low for both proteins. In fact, at the

temperature used during full-atom Rosetta structure prediction during the CASP

competitions (0.8 in arbitrary units, internal to the Rosetta code) the entropy

138

dominates the free energy. As a result, the native state is the free energy maximum

instead of the desired minimum.

Figure 37. Average energy (<∆E>), conformational entropy (<∆S>), and free energy (<∆F>) as a

function of Cα RMSD for protein G and engrailed homeodomain (EH).

This observation gives some insight into the limitations currently observed

with Rosetta structure prediction. Rosetta uses a hierarchical approach in which

coarse-grained structure predictions are made and then used as starting points for full-

atom refinement (49). A number of recent works have noted that for full-atom

refinement to be successful, i.e. reach RMSD values less than 2 Å, the initial

configuration must be within a “radius of convergence” of about 3 Å from the native

state (150, 199). Our results show that the free energy difference between 3 Å and 2 Å

is about 5 kT and, therefore, sampling a 2 Å structure when starting from a 3 Å

139

structure is extremely unlikely. The improbability of moving to lower RMSD

structures is consistent with the fact that one to ten thousand independent runs must be

performed in order to find a few accurate full-atom structures with Rosetta’s ab initio

structure prediction protocol (151).

TEMPERATURE DEPENDENCE OF THE FREE ENERGY

The relative importance of the energetic and entropic contributions to the free energy

may be tuned by adjusting the temperature ( SF ). Namely, the energetic

term will dominate at sufficiently low temperatures while the entropic term will

dominate at higher temperatures. By assuming that the average energy and

conformational entropy are independent of temperature we are able to predict the

temperature dependence of the free energy. We can then predict what temperature one

would have to use in Rosetta structure prediction in order for the free energy

landscape to have the desired correlation with the RMSD.

We find that the free energy landscape has the desired shape (i.e. stable native

state, unstable unfolded state) at temperatures below 0.5, as shown in Figure 38. At

temperatures above 0.5 the free energy landscape still has a maximum at the native

state. At a temperature of about 0.5 there are still non-trivial barriers between the

native and unfolded state but the free energy landscape is essentially flat relative to

other temperatures.

140

Figure 38. Average free energies (<∆F>) as a function of Cα RMSD for temperatures of 0.5 and 0.1 for

protein G and engrailed homeodomain (EH). The black lines are the hypothesized free energy at

the given temperature and the dash-dot lines are the free energy at temperature 0.8 shown for

reference.

EXPLOITING THE TEMPERATURE DEPENDENCE

While the projections of the thermodynamic landscapes shown in Figure 37 and

Figure 38 appear to be smooth, the true landscapes are actually quite rugged due to

energetic terms like hydrogen bonding and Van der Waals interactions. In order to

explore this space the standard Rosetta full-atom refinement protocol uses a

combination of MC and minimization moves (49). The minimization moves are

intended to guide the protein towards the native state at the energy minimum while the

MC moves are intended to help the protein overcome small barriers. For the MC

moves to perform this function they must use a sufficiently high temperature to

overcome small barriers but a low enough temperature to avoid mitigating the

effectiveness of the minimization moves. Simply running the standard protocol at a

lower temperature is likely to destroy this balance and prevent the system from

overcoming even trivially small barriers, thus drastically slowing the dynamics.

However, using our insights into the temperature dependence of the free energy

141

landscape it may be possible to devise a temperature ST protocol that could overcome

this roughness and reach the native state.

To test this hypothesis we have implemented a temperature ST version of the

full-atom Rosetta refinement protocol, as well as a variant of the standard protocol that

runs at a temperature of 0.1. For the ST variant we used a temperature range of 0.1 to

0.5 and a purely MC move set in order to obey detailed balance. Broad sampling

should be possible at a temperature of 0.5 because of the relative flatness of the

landscape, while at lower temperatures the native state should be favored.

Temperatures above 0.5 are not used because they would favor unfolding. The low

temperature variant allows us to ensure that any improvements seen with the ST

variant over the standard protocol are not simply the result of running at lower

temperatures. Both the standard and low temperature variants use the full set of MC

and minimization moves available in Rosetta.

Our ST variant is found to outperform both standard Rosetta and the low

temperature variant. For each of these three protocols we performed 100 runs starting

from a 5.7 Å structure, well beyond the radius of convergence, drawn from our

umbrella sampling simulations. Figure 39 shows our 5.7 Å starting structure alongside

protein G’s native state as a reference. Figure 40 shows histograms of the lowest

RMSD found in each run. One ST run reached an RMSD value of 4.8 Å and 37% of

the ST runs found structures with RMSD values lower than the initial configuration.

However, neither the standard protocol nor the low temperature variant were able to

find any structures with RMSD values less than that of the initial configuration. The

increased ability of our ST protocol to move towards the native state demonstrates that

utilizing explicit knowledge of the entropic contribution to the free energy may

improve structure prediction, even when the physical conformational entropy is not of

interest.

142

Figure 39. (A) The native structure of protein G and (B) the 5.7 Å starting structure used for comparing

the ST and Standard Rosetta variants.

Figure 40. Distribution of the minimum Cα RMSD values reached by 100 Simulated Tempering (ST)

and 100 standard Rosetta runs started from a 5.7 Å structure. Results for both the low temperature

and standard Rosetta variants were identical so only a single plot is shown.

PHYSICAL PERSPECTIVE ON ENERGETIC TERMS

A physical perspective may also be taken in order to evaluate and improve individual

energetic terms. For example, Rosetta’s hydrogen bonding term (187) is seen as a

critical component of the full-atom scoring function (199). While this term agrees with

quantum calculations (188), it has been found empirically that the hydrogen bonding

potential only helps discriminate between models within about 3 Å of the native state

(187).

We find that the hydrogen bonding term actually impedes the kinetics of

convergence while providing only a minor energetic advantage to near-native states

143

and, therefore, ultimately impedes rapid and accurate structure prediction. Figure 41

shows that the average hydrogen bonding energy is somewhat lower within about 3 Å

of the native state for protein G but not for EH. For both systems, however, the

average hydrogen bonding energy is basically flat relative to the total energy. Because

the average hydrogen bonding energy is flat, it does not necessarily provide any

guiding force to bias the system towards the native state.

Figure 41. Relative magnitude of the average hydrogen bonding energy (solid line) versus the total

average energy (dash-dot line) as a function of Cα RMSD for protein G and engrailed

homeodomain (EH).

Shmygelska and Levitt have reported that Rosetta’s hydrogen bonding

potential is better able to discriminate native from non-native states than the low-

resolution potentials (200). The most likely explanation for this apparent discrepancy

is that they weighted the hydrogen bonding term more heavily. During our simulations

the long-range hydrogen bonding term was weighted by a factor of one while the

short-range term was weighted by a factor of 0.5, following the protocol used by the

Baker group in CASP 7. If these terms were weighted more heavily relative to the rest

of the potential a stronger bias towards the native state could arise. For example, the

small dip we observe in the hydrogen bonding term for protein G could become quite

substantial. Comparing our results with those of Shmygelska and Levitt is also

complicated by the fact that they sampled the hydrogen bonding term in the context of

Rosetta’s less accurate low-resolution potentials while we have sampled it in the

144

context of the more accurate full-atom potential. A more extensive comparison of our

methods in the context of the full-atom potential is an interesting future direction.

We suggest that structure prediction potentials could possibly be improved by

avoiding such flat terms or reweighting them such that they provide a substantial

biasing force towards the native state. We note that proteins can have surprisingly fast

kinetics, with some small proteins folding on the microsecond time scale (57). One

outstanding question is whether it is even feasible to design a knowledge based

potential that can accurately identify protein native states and have kinetics that are

faster than physical kinetics. If not, physics based methods may actually be the fastest

algorithms for complex systems as they may be able to take advantage of the

evolutionary optimization or the physical processes for kinetics present in the natural

kinetics of protein folding. Even if this is not the case, our results show that structure

prediction may benefit by taking advantage of ideas developed to better understand

folding mechanisms. Informatics approaches that incorporate more physical insights

into protein folding mechanisms are thus an interesting direction (205-207).

CONCLUSIONS

Our results demonstrate that explicitly accounting for conformational entropy and

considering the kinetics of convergence may improve structure prediction even if

physical conformational entropies and kinetics are not of interest. For example, by

understanding the interplay between energy and conformational entropy one can

choose an optimal temperature or set of temperatures to use for exploring

conformational space. By considering the kinetics of convergence one can ensure that

this space can be explored rapidly, resulting in computationally efficient structure

prediction protocols. An outstanding question is whether it is possible to design

knowledge-based potentials with better entropic and kinetic properties than our

physical reality. If not, physics based structure prediction may ultimately be necessary

for more complex systems. Whether or not this is the case, our results show that

145

structure prediction may benefit by taking advantage of ideas developed to better

understand folding mechanisms.

MATERIALS & METHODS

All structural representations were generated using VMD (67).

TEMPERATURE ST

Temperature ST (24, 25) simulations perform a random walk within a pre-determined

temperature set (T1, …, Tn). This is accomplished using an expanded Hamiltonian

ii gXEXH )()(

where ii kT1 , E (X) is the energy (or score) of the current configuration (X), and

gi is the weight corresponding to Ti. At regular intervals the simulation attempts to

move either up or down in temperature space with equal probability. The probability

of accepting a given move is

),1min()( )()( ijij ggXEejiP

where P (i→j) is the probability of moving from Ti to Tj.

Our temperature ST simulations used a temperature list of 0.1, 0.15, 0.2, 0.3,

0.4, and 0.5 in arbitrary units internal to the Rosetta code and temperature exchanges

were attempted every 50 steps. All weights were determined using the Simulated

Tempering Equal Acceptance Ratio (STEAR) method (49). This method obtains an

initial estimate of the weights from short constant temperature simulations at each

temperature and then refines these weights in subsequent ST simulations before

holding them constant in the final data collection phase. Two iterations of weight

refinement consisting of 100 runs of 600,000 steps were performed for temperature ST

146

simulations, followed by 100 runs of 600,000 steps for data collection. In order to

maintain detailed balance the ST simulations only used MC moves in torsion space.

RMSD ST

RMSD ST simulations perform a random walk amongst a predetermined set of

umbrellas constraining the system to a given RMSD from the native state without

changing the system’s temperature. In this case the expanded Hamiltonian and

probability of accepting a move are

iicurrent gRMSDRMSDaXEXH ])()([)( 2

),1min()( ])()[( 22ijicurrentjcurrent ggRMSDRMSDRMSDRMSDaejiP

where kT1 , E (X) is the energy of the current configuration (X), RMSDcurrent is

the current RMSD from the native state, RMSDi is the center of umbrella i, and “a”

determines the strength of the spring constraining the system to a given umbrella.

Our RMSD ST simulations used umbrellas centered at RMSD values from 0.5

to 10 Å at 0.5 Å intervals and jumps between neighboring umbrellas were attempted

every 50 steps. The “a” parameter was set to three. All weights were determined using

the Simulated Tempering Equal Acceptance Ratio (STEAR) method (49). This

method obtains an initial estimate of the weights from short umbrella simulations at

each umbrella (without any jumps between umbrellas) and then refines these weights

in subsequent RMSD ST simulations before holding them constant in the final data

collection phase. Three iterations of weight refinement consisting of 100 runs of

1,700,000 steps were performed for RMSD ST simulations, followed by 100 runs of

900,000,000 steps for data collection. In order to maintain detailed balance the RMSD

ST simulations only used MC moves in torsion space.

147

ROSETTA

For an overview of the Rosetta structure prediction algorithm and the command-line

options used in this study see reference (49). The full Rosetta move set was used for

standard Rosetta runs. The same number of moves was used when comparing standard

Rosetta runs with ST.

148

CHAPTER 9: STRUCTURAL INSIGHT INTO RNA HAIRPIN FOLDING

INTERMEDIATES

This chapter was taken from: Bowman GR, et al. (2008) Structural insight into RNA

hairpin folding intermediates. J Am Chem Soc 130:9676-9678.

ABSTRACT

Hairpins are a ubiquitous secondary structure motif in structured RNA molecules.

Despite their simple structure, there is some debate over whether they fold in a two-

state or multi-state manner. We have studied the folding of a small tetraloop hairpin

using a serial version of the replica exchange method on a distributed computing

environment. Based on these simulations we have identified a number of intermediates

that are consistent with experimental results. We also find that folding is not simply

the reverse of unfolding and suggest that this may be a general feature of biomolecular

folding.

INTRODUCTION

RNA hairpins are one of the most common secondary structure motifs, appearing in

most every large RNA structure (208-210). In addition to serving as nucleation sites

for RNA folding (211), they may also guide RNA folding by forming tertiary contacts

(212, 213) and serve as recognition sites for RNA binding proteins (214). They are

potential drug targets (215), terminate transcription (211), and influence translation

through their role as aptamer domains in riboswitches (216). Despite the great variety

of functions they may serve, hairpins are one of the simplest RNA motifs, requiring

only monovalent ions to fold. Thus, understanding the folding of small RNA hairpins

is both a critical first step in understanding the folding of larger RNA molecules (215)

and amenable to computer simulation (217-219).

149

RNA hairpins consist of a primarily Watson-Crick base-paired stem capped

with a loop of unpaired or non-Watson-Crick base-paired nucleotides. Tetraloops,

such as the GCAA tetraloop (5’-GGGCGCAAGCCU-3’) examined in this work and

shown in Figure 42, have four such bases in their loop. This particular structure was

chosen due to its predominance in the ribosome (210).

Figure 42. (A) NMR structure of the GCAA tetraloop. (B) Contact map for the native state. Bases are

numbered from 5’ to 3’ and native base-pair contacts (dotted lines) are numbered 1-4.

Despite their simple structure there is some controversy over whether these

hairpins fold in a two-state or multi-state manner. The two-state hypothesis for nucleic

acid hairpins is primarily based on thermodynamic measurements. For example,

Ansari et al. found similar sigmoidal melting curves when they monitored all the base-

pairing interactions or a subset of fluorescently labeled nucleotides (220). The multi-

state hypothesis is based on kinetic measurements, such as FCS and T-jump

experiments. For example, Jung et al. found discrepancies between equilibrium

distributions from FCS and melting experiments (221). More recently, Ma et al. found

evidence of melting in T-jump experiments starting at temperatures above the melting

temperature (TM), indicating that the supposed unfolded state in melting experiments

is not completely unstructured (222, 223). These authors went on to propose an

intermediate state in which the ends of the hairpin are in contact but the base-pairing

and base-stacking interactions in the stem are not yet formed.

150

To determine if there is in fact an intermediate and, if so, what its structure is,

we have run Serial Replica Exchange Molecular Dynamics (SREMD) (177, 224)

simulations of the GCAA tetraloop depicted in Figure 42. Due to the heterogeneity of

the loop (225, 226) we have defined the native state as any conformation with all four

stem base-pair contacts formed, numbered as shown in Figure 42B. We refer to these

base-pair contacts as native contacts. Two nucleotides are considered to be contacting

if any two atoms, one from each nucleotide, fall within 3 Å of each other. Thus, a

structure can be well described by a contact map—a bit string specifying which

residues are in contact.


Previously, Sorin et al. studied the folding of this system using constant temperature

Molecular Dynamics (MD) and explicit solvent (217). While these studies provided

valuable insight into the folding of RNA hairpins, only 19 folding events were

observed within the thousands of simulations run. We have applied SREMD on the

Folding@home infrastructure to obtain better sampling and, therefore, greater insight

into RNA folding.

SREMD is a serial version of the Replica Exchange Molecular Dynamics

(REMD) (22, 23), which induces the system to perform a random walk in temperature

space such that broad sampling is achieved at high temperature and detailed

exploration of free energy minima is achieved at low temperature. In REMD, multiple

simulations are run, each at a different temperature. A random walk in temperature

space is achieved by periodically attempting to swap the conformations at two

neighboring temperatures. The probability of accepting a swap is

),1min()( ))(( jiij UUejiP (1)

where P (i→j) is the probability of transitioning from temperature i (Ti) to temperature

j (Tj), βi is 1/ (kTi), and Ui is the potential energy of the conformation at Ti. Thus, the

151

detailed balance condition is satisfied. SREMD allows any number of asynchronous

simulations to be run, making it more suitable for distributed computing than standard

REMD (177). This is accomplished by providing each simulation with the Potential

Energy Distribution Function (PEDF) for each temperature. SREMD uses the same

criteria for swapping temperatures as REM except that the energy of the current

conformation is compared to an energy randomly drawn from the neighboring

temperature’s PEDF rather than the energy from a parallel simulation. The simulation

parameters are described in detail in Appendix G.

We ran 2,800 SREMD simulations with an aggregate simulation time of 54.6

µs starting from the NMR structure (PDB code 1ZIH) (209). Even with this amount of

simulation, reversible folding was not achieved and we cannot claim to be at

equilibrium (203). However, we did observe 760 trajectories with a complete

unfolding event and 550 trajectories with a complete refolding event. Thus, we have

sufficient data to define the dominant states in the folding and unfolding pathways,

though we cannot give their relative probabilities. While SREM will not give any

kinetic information directly, an analysis of the relevant thermodynamic states can

yield information about the states along the folding and unfolding pathways.

An unfolding event is defined as the set of conformations between the first

point with no contacts between any two residues on opposite sides of the stem and the

first preceding point with four native contacts. A refolding event is defined as the set

of conformations between the first point with no contacts between any two residues on

opposite sides of the stem and the first subsequent point where the number of native

contacts is four.

We used Mapper (227, 228), topological data analysis algorithm, to identify

the dominant states in the folding and unfolding pathways. For example, to understand

unfolding we applied the Mapper technique to conformations from unfolding events,

where the conformations were represented by contact maps. The mapper clustering

technique works as follows. First, the similarity between each pair of conformations

152

was determined using the Hamming distance metric. The data set of interest was then

divided into overlapping subsets based on the density of configurations around each

conformation, allowing efficient identification of intermediate states with low

populations as well as folded/unfolded states with high populations. Single-linkage

clustering was carried out in each subset, facilitating the identification of non-convex

clusters. Finally, a graph was generated that represents the connectivity between

clusters in different density levels based on their degree of overlap. More details are

provided in the SI.

In SREM, replicas visiting high temperatures lead to rapid unfolding. To better

understand this unfolding process, we first calculated the probability of having one,

two, or three native contacts during unfolding as shown in Figure 43A. This data

indicates that there is substantial breathing, with one or two base-pairs being broken

and reformed, but that complete unfolding quickly follows the breakage of three native

contacts. Further insight is provided by Figure 43C, where we show the probability of

each native contact given that a certain number of native contacts are present.

Apparently, unfolding has a single dominant pathway characterized by unzipping from

the end. This result is confirmed by Mapper, as shown in Figure 44. There is no cluster

corresponding to a single native contact due to the low probability of such structures.

Structures with three native contacts also appear to be absorbed into either the native

cluster or the cluster with only two base-pairs formed, probably due to the use of the

simple Hamming distance metric.

153

Figure 43. The probability of a given number of native contacts during (A) unfolding and (B) refolding.

(C) The probability of each contact when a given number of contacts are present during unfolding

and refolding with the arrows representing the direction of movement between the unfolded state

(U) and the folded state (F).

Figure 44. Contact maps representing the cluster centers from independent clustering of the unfolding

(A) and refolding data (B). The grey lines represent the connectivity of the states. The blue lines

represent native contacts with a probability of 0.6 or greater within the cluster. Intermediate

structures are labeled A-D.

Figure 43B shows that there is often a single contact present during refolding

but adding subsequent base-pairs becomes progressively less likely. Thus, there are

many nucleation events consisting of the formation of a single native contact but few

proceed to the folded state. Figure 43C again shows the probability of each contact

154

given that a certain number of contacts are present. When a single native contact is

present, it is most likely between the closing base-pair or the two ends, native contacts

1 and 4 respectively. The higher probability of native contact 1 is probably due to the

close special proximity of the two participating residues imposed by their close

proximity in the sequence. The higher probability of native contact 4 may be

explained by the lack of steric hindrance relative to the other native contacts. Once

two or three native contacts are formed each is more or less equally probable, which is

consistent with numerous models.

The results from Mapper shown in Figure 44 give more insight. The first step

is either the formation of the closing base-pair or the end base-pair. This is followed

by the formation of native contacts 1 and 2 and subsequent folding is dominated by

zipping. Presumably, the formation of the end base-pair facilitates the formation of

native contacts 1 and 2 by reducing the conformational space that needs to be

searched, as predicted by Ma et al. (222). The fact that the end base-pair does not

appear in the center of the cluster with two native contacts doesn’t mean it breaks as

folding proceeds, just that it does not occur frequently within the cluster. This is

consistent with the fact that about four times as many refolding events occur through

the pathway starting with the formation of native contact 1 as go through the pathway

starting with the formation of native contact 4. Once again, we note these relative

probabilities are not necessarily expected to be found in experimental studies due to

the random walk in temperature space our simulations undergo. However, these are

expected to be the two dominant pathways.

The two folding pathways observed here are consistent with the zipping and

compactions mechanisms observed by Sorin et al. (217) as well as experimental work

pointing to the presence of multiple folding pathways (215, 229). Furthermore, these

results support the hypothesis that the folding pathway of RNA hairpins has at least

three states. In particular, the collapsed structure with a single native contact between

the end base-pair is consistent with the intermediate structure proposed by Ma et al.

(222). However, the other clusters along the folding pathway with one, two, or three

155

native contacts formed may also contribute to the experimental signal. Full-atom

structures for each of these intermediates are shown in Figure 45. Reptation (defined

as the sliding of the two strands of the stem relative to one another) is not one of the

dominant folding pathways, in agreement with results for small β-hairpins (230).

Thus, it appears that misfolded states must unfold before refolding properly, although

we cannot discount the possibility that they may contribute to folding on longer

timescales than our simulations reach. Results from the unfolding analysis using

Mapper lend further support to this hypothesis. They include small clusters of reptated

structures between the folded and intermediate states (data not shown), consistent with

the idea that misfolding serves as an off-pathway trap that slows the overall folding

process (215, 220, 223, 231).

Figure 45. Representative full-atom structures for the intermediate states with labels (A)-(D)

corresponding to the labels A-D in Figure 3.

Another result of this work is that folding and high temperature unfolding

follow different pathways. We propose that this may be a general feature of hairpin

folding, due to the intrinsic similarities in the thermodynamic forces which stabilize

156

their structure. Furthermore, the amount of sampling we have achieved and the fact

that we have still not reached convergence calls into question the results of shorter

REMD studies. Such simulations will be dominated by non-equilibrium unfolding,

which as we show here does not necessarily provide any insight into folding. Applying

measures of convergence, such as reversible folding or agreement between simulations

with different starting states, is critical for validating such studies.

CONCLUSIONS

The results presented here support recent work indicating that the folding of even the

smallest of RNA motifs is more complicated than previously suspected. We have

identified a number of folding intermediates consistent with experimental

observations. We also found multiple highly populated folding pathways but only a

single dominant unfolding pathway. Significant sampling was necessary to gain any

statistics on folding, indicating that shorter simulations are dominated by unfolding,

which differs from the folding pathway in this systems. In future works we intend to

determine the sequence dependence of intermediate states and folding kinetics. Such

work will also provide more insight into whether or not folding and unfolding differ

for biomolecules in general.

157

CHAPTER 10: RAPID EQUILIBRIUM SAMPLING INITIATED FROM NON-

EQUILIBRIUM DATA

This chapter was taken from: Huang X, Bowman GR, Bacallado S, & Pande VS

(2009) Rapid equilibrium sampling initiated from nonequilibrium data PNAS

106:19765-19769.

ABSTRACT

Simulating the conformational dynamics of biomolecules is extremely difficult due to

the rugged nature of their free energy landscapes and multiple long-lived, or

metastable, states. Generalized Ensemble (GE) algorithms, which have become quite

popular in recent years, attempt to facilitate crossing between states at low

temperatures by inducing a random walk in temperature space. Enthalpic barriers may

be crossed more easily at high temperatures; however, entropic barriers will become

more significant. This poses a problem because the dominant barriers to

conformational change are entropic for many biological systems, such as the short

RNA hairpin studied here. We present a new efficient algorithm for conformational

sampling, called the Adaptive Seeding Method (ASM), that uses non-equilibrium GE

simulations to identify the metastable states and seeds short simulations at constant

temperature from each of them to quantitatively determine their equilibrium

populations. Thus, the ASM takes advantage of the broad sampling possible with GE

algorithms but generally crosses entropic barriers more efficiently during the seeding

simulations at low temperature. We show that only local equilibrium is necessary for

ASM so very short seeding simulations may be used. Moreover, the ASM may be

used to recover equilibrium properties from existing datasets that failed to converge

and is well-suited to running on modern computer clusters.

158

INTRODUCTION

The functions of biological macromolecules are in large part determined by their

structure and dynamics. As such, many experimental techniques have been developed

and applied to probe these properties, each of which has its strengths and weaknesses.

Computational methods such as Molecular Dynamics (MD) and Monte Carlo (MC)

simulations have the potential to complement such experiments by modeling the

evolution of entire systems with atomic resolution. However, it is extremely difficult

to obtain equilibrium sampling of even moderately sized systems in atomic

simulations because of the rugged nature of the free energy landscapes that must be

explored. Without adequate sampling, it is impossible to validate the parameters, or

force fields, that determine the interactions between atoms or to address phenomena

that occur on biologically relevant timescales.

Many methods have been developed in an attempt to address the sampling

problem. Generalized Ensemble (GE) algorithms like Replica Exchange Method

(REM) (or Parallel Tempering ) (22, 23) and Simulated Tempering (ST) (24, 25) are

popular approaches for studying biomolecular folding(26-28, 177, 232-238). They

attempt to overcome the sampling problem by inducing a random walk in temperature

space while maintaining canonical sampling at each temperature. At high temperatures

energetic barriers may be crossed easily while at low temperatures the system is

generally constrained to local minima. However, recent studies have shown that GE

simulations do not yield converged equilibrium sampling much faster than standard

constant temperature MD if the phenomena of interest are non-Arrhenius. (27, 177,

238-243)

For example, Zuckerman et al. (240) used the Arrhenius equation to argue that

the maximum efficiency gain of GE simulations is no more than an order of

magnitude at physiological temperatures and Zheng et al. (241, 242) used a kinetic

network model to show that there is an optimal temperature for non-Arrhenius folding

kinetics and any time spent above this temperature will decrease the efficiency of GE

159

simulations. This lack of improvement is the result of the interplay between energy

and entropy. While high temperatures may facilitate the crossing of energetic barriers,

entropic barriers will be more difficult to cross. (27)

Thus, GE simulations will provide little improvement when the dominant

barriers are entropic. Hansmann and coworkers have made some effort to improve the

effectiveness of GE algorithms by optimizing the temperature spacing. (244, 245)

However, these methods assume that diffusion in temperature space is the rate limiting

process so crossing entropic barriers in the conformational space, which is the true rate

limiting process, is still a problem. A number of other methods also exist, such as

umbrella sampling, and milestoning (156). However, these methods require that the

dominant reaction coordinate is known a priori and this information is often

unavailable.

The sampling problem is exacerbated by the practice of viewing global

equilibration of individual trajectories as a requirement for considering a simulation to

have reached equilibrium. Global equilibration is most naturally obtained by running a

simulation much longer than the longest relaxation time of the system, so that all

degrees of freedom are equilibrated and many uncorrelated samples are generated

from each metastable state. For example, the reversible folding metric holds that a

simulation has reached equilibrium if there are multiple folding and unfolding events.

(203) While this criterion is sufficient, it may not be necessary. Instead of requiring

global equilibration of individual trajectories, we suggest that local equilibration may

be sufficient. Local equilibration may be achieved by using multiple simulations, each

of which visits only a subset of the metastable states with their correct Boltzmann

probabilities but that together cover the entire accessible space. Local equilibration

may require significantly less wall-clock time because shorter simulations (all of

which may be run in parallel) are required. The main difficulty is to analyze multiple

simulations appropriately.

160

)

Markov State Models (MSMs) are a powerful tool which can be used to extract

equilibrium properties from a dataset that satisfies the local equilibration criterion.

MSMs partition phase space into metastable states such that intra-state transitions are

fast but inter-state transitions are slow.(3, 6, 11, 33, 34) Such separation of timescales

ensures that the model is Markovian, that is, the probability of being in a given state at

time t+∆t, where ∆t is called the lag time, depends only on the state at time t. The key

point is to build a model with a lag time that is shorter than the timescale of the

process of interest with few enough states that it may be understood easily. Usually

MSMs are used to study kinetics, but here we only derive thermodynamic information

from them. In an MSM, the time evolution of a vector representing the population of

each state may be calculated by repeatedly left-multiplying by the transition

probability matrix.

( ) [ ( )] (0nP n t T t P (1)

where P(n∆t) is a vector of state populations at time n∆t, T is the column-stochastic

transition probability matrix. The first left eigenvector of the transition matrix T

corresponds to the equilibrium distribution(6). This can be an advantage and a useful

opportunity, since obtaining kinetics from MSMs is challenging, but obtaining only

the equilibrium thermodynamic properties might be a less demanding goal as less

information is required. Indeed, we find that the populations of the dominant states are

invariant with respect to the lag time so very short simulations can be used.

Here, we introduce the Adaptive Seeding Method (ASM) and show that it

rapidly yields converged thermodynamics even when faced with entropic barriers by

exploiting many simulations at local equilibrium. This is achieved by 1) using non-

equilibrium GE simulations to obtain broad sampling, 2) building a Markov State

Model (MSM) to identify all the metastable states as shown in Figure 46, 3) starting

new constant temperature simulations at the temperature of interest from each

metastable state in a process called seeding, and 4) using MSMs to extract the correct

equilibrium populations from the seeding simulations. Seeding short simulation from

161

the known equilibrium distribution of alanine dipeptide has been shown to yield good

models for its kinetics (246). A key advance in our new method is that it does not

require that the initial sampling has reached equilibrium. We note that many non-

equilibrium GE datasets have been generated due to the difficulty in reaching

equilibrium and that there is growing interest in recovering equilibrium properties

from such datasets(247). Thus, one strength of the ASM is that steps 2-4 may be used

to recover the correct equilibrium thermodynamic properties from a non-equilibrium

dataset. Furthermore, this procedure may be iterated and combined with adaptive

sampling (161) to most efficiently use one’s computational resources, i.e. using the

fewest and shortest trajectories necessary to achieve a good model, since minimizing

wall clock time is an important consideration for computer simulations.

Figure 46. A schematic free energy landscape with three representative seeding trajectories started from

each basin and a projection of this free energy landscape onto a 2D plain showing the division into

metastable states.

To test the ASM we apply it to a small biomolecular system with long time

scale dynamics: an eight nucleotide RNA hairpin (5’-GCUUUUGC-3’) known as the

162

UUUU tetraloop. Hairpins are a fundamental RNA secondary structure motif(208) and

perform many biologically relevant functions but our understanding of their folding is

still incomplete.(52, 217, 220, 222, 223) The folding of this hairpin is diffusion

controlled(215, 220, 248), so despite its small size the folding time is on the µs

timescale, as measured by laser temperature jump experiments(223). Thus, capturing a

single folding event with a single MD simulation with explicit solvent would likely

take more than a year on a typical CPU. ASM, however, is able to reach converged

equilibrium sampling within a week using many short parallel simulations, as judged

by agreement on the populations of metastable states between distinct sets of

simulations started from very different initial configurations. ASM is also found to

yield converged thermodynamic properties with at least six times less sampling than

GE simulations for this system. Finally, the fact that the most highly populated

metastable state has a well-formed two base pair stem, as in the NMR structure,

provides some validation of the force field. Since there is no analytical solution for the

equilibrium distribution of our RNA hairpin system, we also studied a 2D potential

where the equilibrium populations can be computed analytically. Using this model, we

confirm that ASM is much more efficient than ST, and also provide some guidelines

for choosing the optimal number and length of the seeding simulations.


COMPARISON OF ASM TO ST

Here we compare the results of our long ASM procedure with an equivalent amount of

ST sampling, as depicted schematically in Figure 47. We ran two distinct sets of

simulations starting from a near-native and random-coil configuration respectively, as

shown in Figure 81. Thus, we are able to judge the convergence of our results by

comparing these two datasets.

163

Figure 47. Schematic of the adaptive seeding scheme. The top arrow represents our ST trajectories,

which are split into equilibration (green) and production (light blue) phases. The light red and light

yellow boxes encompass our long and short adaptive seeding schemes respectively. For each

adaptive seeding scheme, the dotted lines demark the portion of the ST data used to identify the

dominant thermodynamic, or metastable, states by building an MSM (S). Constant temperature (or

canonical, NVT) simulations are then started from each state and used to build a new MSM (E) that

captures the equilibrium distribution. Both the light yellow and red boxes also encompass a portion

of the original ST data that is equivalent to the amount of sampling used in the adaptive seeding

scheme. An MSM is also built for this data and used as a baseline for judging the efficiency of the

adaptive seeding scheme.

The first step was to run an independent set of one thousand 18 ns ST

simulations starting from each initial configuration to obtain broad sampling. During

an initial equilibration phase (first 9 ns) the weights were updated using the Simulated

Tempering Equal Acceptance Ratio (STEAR) method(49, 177) described in Appendix

H. This procedure was found to give nearly equal sampling of each temperature and

converged weights for each dataset (Appendix H). During the subsequent 9 ns

production phase the weights were held constant. These two sets of ST simulations do

not reach converged sampling because of their short length (data not shown), but they

should be able to reach all the metastable states.

164

To identify the metastable states we built an independent MSM from the

production phase of each dataset. First, all the conformations from every temperature

were divided into a large number of small sets of very structurally similar, and

therefore likely kinetically similar, conformations called microstates using a

hierarchical K-medoids clustering algorithm as described in Appendix H. We then

used spectral clustering (249, 250) (PCCA (6, 44, 45)) refined with simulated

annealing to lump microstates that can interconvert rapidly into larger states called

metastable states while conformations separated by large free energy barriers are

grouped into different states, as depicted in Figure 46. This algorithm was developed

by Chodera et al (6) and is also described in the SI. This procedure yielded six states

for each dataset.

To obtain equilibrium sampling we then seeded simulations from each

metastable state. Specifically, 100 random conformations were chosen from each

metastable state and used as starting points for 10 ns constant temperature MD

simulations at 300K. The equilibrium distribution was extracted by building a new

MSM. A common state definition is necessary in order to compare different datasets

so this MSM was built using all the seeding data. Populations with error bars for each

independent dataset were then determined under the same state definition using a

Bayesian method developed by Noe.(251) Figure 48A shows that the populations from

each seeding dataset, as well as the combined data, are in strong agreement and are

therefore converged to the equilibrium distribution. Populations for an equivalent

amount of folded and coil ST data (19 ns) were also calculated by considering only

those conformations at 300 K. These two ST datasets have not converged yet. In

particular, there is a relatively obvious difference in the populations of states 2 and 4

(about 10% and 7% respectively).

165

Figure 48. Population of each state (bar graphs correspond to the mean values, and error bars stand for

standard deviations) for (A) the long adaptive seeding scheme (lag time t=4.5 ns) and (B) the short

adaptive seeding scheme (lag time t=4.5 ns).

MSMs are usually used to study kinetics(3). In order to get a reasonable

number of states and ensure that the model is Markovian, a relatively long lag time

must be used, though it generally ought to be shorter than the timescale of the process

of interest. Furthermore, to get accurate kinetics each simulation must be at least a few

times longer than the lag time so that multiple crossings of each barrier may be

observed. For example, Chodera et al. show that a twenty state MSM for the folding

of the helical Fs-peptide (which occurs on a timescale of tens of nanoseconds) requires

a lag time of five nanoseconds.(6) Thus, obtaining accurate kinetics for the UUUU

tetraloop, which folds on a microsecond timescale, should likely require orders of

magnitude longer simulations than for the Fs-peptide. However, obtaining accurate

thermodynamics may require significantly less sampling. In particular, short lag times

where the system is not Markovian may still be sufficient to estimate thermodynamic

properties. In fact, Figure 49 shows that the equilibrium populations of each state are

identical within statistical error regardless of the lag time. Similar observations have

been made by Hummer and coworkers who found that the free energy profile for a

water dewetting transition can be predicted using a very short lag time at which the

kinetics are not reproduced well(11). In addition to the error due to non-Markovian

effects, the statistical error due to insufficient sampling of transition events will also

be smaller for thermodynamic properties. In a model with N states there are only N

thermodynamic parameters to determine whereas getting accurate kinetics requires

166

determining all N2 pairwise transition probabilities. Sampling all possible transitions

over-determines the free energy differences between states. Thus, obtaining accurate

thermodynamics may require significantly less sampling.

Figure 49. Population of each state for the long adaptive seeding scheme as the lag time is varied.

MINIMIZING THE SIMULATION LENGTH

To push the limits of the ASM we repeated the above procedure using drastically less

data (See short ASM in Figure 47). Ten times less data was used for equilibration, six

times less ST data was used to identify the states, and the seeding simulations were

half as long. To maximize our use of this minimal data we combined the folded and

coil ST data to identify the metastable states used for seeding. Figure 48B shows the

populations obtained from this procedure compared to a reference distribution from

our long ASM runs and an equivalent amount of ST data started from both folded and

coil states. All these populations were determined using the previous state definition.

The populations from these short ASM runs were found to be in agreement with the

previously determined equilibrium distribution whereas the ST data deviated

significantly from equilibrium. To determine the limits of ASM we also performed the

same analysis using both fewer and shorter trajectories. First we held the seeding

trajectory length constant at 5 ns and varied the number of trajectories initiated from

each state, finding that as few as 70 trajectories from each state gave reasonable

167

agreement with the reference distribution. We also held the number of seeding

trajectories started from each state constant at 100 and varied the trajectory length,

finding that as little as 2 ns long seeding simulations gave reasonable agreement with

the reference. Thus, our ASM method reaches equilibrium at least six times faster than

ST. These results demonstrate that the ASM is significantly more efficient than GE

simulations for sampling conformational changes that are diffusion controlled, as in

hairpin folding.

To address any concerns about the validity of our reference distribution, we

also studied a simple model where the equilibrium populations can be computed

analytically. The model is based on a discrete-state system introduced by

Zwanzig(252) as a simple model for protein folding (see Appendix H for details).

There are four metastable states in the system (folded, unfoled and two intermediate

states), among which the folded state is favored energetically, while the unfolded state

is favored entropically (see Figure 85). This is an attractive system for testing ASM

because it has non-Arrehnius folding kinetics, i.e. the folding rate decreases with

temperature. (see Figure 87).

We compared the efficiency with which ST and ASM reach the equilibrium

state populations as a function of the length and number of trajectories. As shown in

Figure 92, ASM converges to the correct distribution with 4-7 times shorter

simulations than ST. We suggest that seeding simulations longer than the slowest

intra-macrostate equilibration time should always be sufficient for convergence. In

practice, however, much shorter simulations may be used as discussed before. When

using shorter simulations one should test that independent sets of simulations started

from different configurations converge to the same distribution and that the

equilibrium distribution is invariant with respect to the lag time. We also found that

using more than 200 trajectories does not increase the efficiency of ST whereas ASM

continues to scale favorably with the number of trajectories up to 600 trajectories in

this example. The optimal number of simulations to run depends on one’s tolerance

for statistical error. Currently an equal number of simulations are seeded from each

168

state. In the future, however, adaptive sampling [31] could be used to start an optimal

number of simulations from each metastable state to further optimize the efficiency of

this method.

There are a number of factors contributing to the improved efficiency of ASM.

By using short GE simulations to identify the metastable states, the ASM is able to

exploit the ability of GE simulations to rapidly cross energetic barriers while avoiding

the penalty incurred at high temperatures for entropic barriers by using seeding

simulations at low temperatures. Furthermore, only short seeding simulations are

necessary because only local, not global, equilibration is required, due to the use of

MSMs. Global equilibration metrics like reversible folding require that each

simulation is long enough to cross every barrier multiple times. Local equilibration,

however, may be obtained with many short simulations run in parallel because each

run only has to be long enough to cross a single barrier. By using MSMs to identify

the metastable states we can initiate seeding simulations from uncorrelated

conformations within every metastable state and thereby ensure every barrier is

crossed.

The ASM also has limitations. For example, the initial sampling has to be

broad enough to identify all the metastable states. Failure to do so will quickly become

apparent as some states will be populated in one dataset but not in another. This

situation may be remedied by iterating the ASM: that is, seeding ST simulations from

each state to obtain broader sampling, building an MSM to identify the metastable

states, and performing new constant temperature seeding simulations. In addition,

seeding simulations at physiological temperatures are only able to cross barriers on the

order of a few kT. However, this should be sufficient for most biological systems.

Finally, we note that the random selection of initial configurations from each state

may lead to some error if the seeding simulations are not long enough. In the future,

this method might be improved by choosing initial configurations from an equilibrium

distribution prepared within each metastable state(54).

169

EXAMINING THE STATES

Figure 48B shows that the short folded ST data spent a disproportionate amount of

time in state 2 while the coil ST data spent a disproportionate amount of time in state

4. Based on this result, we hypothesized that state two is the native state and that state

four is a random coil state. To test this hypothesis we extracted representative

structures for each state. The representative structure for each state is the configuration

with the greatest density of nearby conformations (mathematically this is the

conformation with the minimal RMSD to every other conformation in the state).

Figure 50 shows the representative structures for each state. In fact, state two is

the folded state, having a well-formed two base pair stem. Our ability to identify this

state without including any knowledge of the native state and the fact that it is the

most populated state lends credibility to the force field used, AMBER99(60).

Furthermore, state four is a random coil. The other states represent various collapsed

non-native states. For example, state 1 has native-like base stacking interactions but no

clear base pairing between the two sides of the stem. State 3 has interactions between

bases 1 and 8 as well as 2 and 7, but they are stacking interactions instead of base

pairing interactions. These results are consistent with both experimental and

computational work showing that small RNA hairpins have folding intermediates with

contacting end residues but without well-formed stems.(52, 222) Fully validating the

force field will require longer simulations to get accurate kinetic predictions and more

extensive comparisons with experimental observables.

170

Figure 50. Representative structure for each of the six metastable states. The numbering is the same as

in Figures 48 and 49.

CONCLUSIONS

We have introduced the Adaptive Seeding Method (ASM) and shown that it samples

significantly more efficiently than GE simulations, which have found widespread use

in studying biological systems(27, 28, 219, 237), for a 2D simple potential and RNA

hairpins. The ASM takes advantage of the broad sampling possible with GE methods

but can more effectively cross entropic barriers using constant temperature

simulations. Moreover, by requiring local equilibration rather than global equilibration

only relatively short simulations are necessary and these simulations may be run in

parallel, rendering the calculation particularly well suited to modern computing

clusters. MSMs are then used to extract global equilibrium populations from these

short simulations. Besides serving as an efficient sampling algorithm, the ASM also

may be used to recover equilibrium properties from non-equilibrium datasets. Thus,

the ASM holds great promise for validating force fields and bridging the gap between

experimental and computational timescales.

In the future, we plan to apply the adaptive seeding method to larger systems.

We also hope to explore alternative sampling methods for identifying initial states. For

example, coarse-grained simulations could be used to identify the dominant states of a

system and seed all-atom MD simulations that would elucidate the atomic details of

the free energy surface. Alternatively, implicit solvent simulations run at low viscosity

could be used to rapidly identify the dominant states and seed explicit solvent

simulations to provide more accuracy. Finally, adaptive sampling(161) with longer

simulations may be used to obtain accurate kinetics from MSMs.

171

MATERIALS & METHODS

Two distinct sets of ST simulations were run: one started from a folded state and the

other from a random coil. An independent MSM was then built for each dataset to

identify the dominant metastable states. We use MSMBuilder(10) to build an MSM.

At first, conformations were first split into a large number of microstates using a

hierarhical K-medoids clustering algorithm with the all heavy atom RMSD as the

distance metric (e.g. we generated 1,597 microstates for long ASM seeding runs).

Kinetically related microstates were then lumped together using PCCA (6, 44, 45).

One hundred random conformations were then chosen from each state and used as

starting points for constant temperature 300K MD simulations, still maintaining two

distinct sets of simulations. New MSMs were built from these constant temperature

datasets. A Bayesian method (251) was used to calculate the populations of each state

with error bars and the models were compared based on these values. The original ST

simulations were also extended to match the sampling of the constant temperature

simulations. State populations with error bars for these long ST runs were computed

using bootstrapping and compared to the populations from the constant temperature

simulations. More details are available in the SI.

172

APPENDIX A: ESTIMATING TRANSITION MATRICES AND EQUILIBRIUM

DISTRIBUTIONS

Given our simulation data and assignments thereof to states, it is necessary to estimate

the transition probability matrix and the corresponding equilibrium distribution. We

have experimented with a number of such methods, all of which give results that are

similar to within error for this data set. However, this property should not be assumed

of other data sets a priori.

First, we show the standard method for estimating the transition probability

matrix T(τ) (or just T for simplicity). The entries of T are the probabilities of

transitions from state i to state j in time τ, that is, Tij = P(i→j). To estimate this, let Cij

= C(i→j) be the number of observed transitions from i to j. Then a reasonable estimate

(a maximum likelihood estimate) is Tij=Cij / Ci, where

(A1) j

iji CC

is the number of observed transitions starting in state i.

To estimate the equilibrium distribution of T, one merely has to find the

stationary eigenvector of T. Under ideal conditions (if the model is ergodic and

irreducible) (253), the stationary eigenvector e is unique and can easily be computed

by repeated multiplication of some initial probability density by T, as in Equation A1.

Similarly, one could use standard eigenvalue routines to find the eigenvector

corresponding to an eigenvalue of 1.

A possible problem with the standard estimate for T is that the resulting model

might not satisfy detailed balance

(A2) jijiji TeTe

173

where ei is the equilibrium probability of state i. The naïve solution to this is to

symmetrize the count matrix by adding its transpose, which amounts to including the

counts that would have arisen from viewing the simulations in reverse. Clearly this

procedure is inappropriate for situations not at equilibrium; nonetheless, we sometimes

find this procedure useful for equilibrium data due to its ease. Furthermore, if the

underlying count matrix is symmetric, one can show that the equilibrium distribution

can be obtained simply by dividing the number of observations in each state by the

total number of observations.

A somewhat more complicated procedure to ensure reversibility is using a

maximum likelihood estimate constrained to the set of models satisfying detailed

balance. To achieve this, assume that we are given the observed count matrix C. By

exploiting the equivalence between this count matrix and a random walk on an edge-

weighted undirected graph (160), we then estimate an additional count matrix, X,

which we require to be symmetric. We compute X by maximizing the likelihood of X

given C; this assumption gives a set of equations that allow the self-consistent

calculation of X. More formally, if C is the observed counts, and X is a symmetric

matrix that approximates C, then the likelihood is

ijC

ji i

ij

X

XCXL

,

)|( (A3)

Maximizing the likelihood yields the following equation, which we solve by self-

consistent iteration,

j

j

i

i

jiijij

X

C

X

C

CCX

(A4)

where Ci and Xi are defined as the row sums of C and X, respectively, as in Equation

A1. In our experience, this method works but it can be slow for the large matrices we

174

consider. Furthermore, statistical noise in the count data can dominate the resulting

equilibrium distribution and even cause the self-consistent iterations to diverge.

A final method is that of Bacallado et al. (254), which uses Bayesian inference

with a prior on the space of matrices satisfying detailed balance. This method is

formally the most sound, as it uses Bayesian inference and includes a powerful prior.

However, it is much more computationally demanding than the other methods. Thus,

this method was also applied to the data in order to assess the validity of the simpler

methods.

We find that the four methods mentioned above give similar results for the

underlying equilibrium distribution of this dataset, indicating that we have achieved

equilibrium sampling. As such, we have used the naïve method of symmetrizing the

matrix due to its computational efficiency (and the fact that we have so much data,

that our data set is very close to having reached equilibrium). However, in general, we

stress that either the maximum likelihood or Bayesian methods should be used.

175

APPENDIX B: THE POSSIBILITY OF LONGER TIMESCALES THAN THE

IMPLIED TIMESCALES

Here we show a simple model demonstrating that the rates for transitioning between

some states in an MSM under a two-state assumption (as used in the maximum

likelihood approach of Ensign et al. (58)) may be slower than the implied timescales.

First we define a four state system that satisfies detailed balance

0.998 0.001, 0.001, 0.000,

0.001 0.998, 0.000, 0.001,

0.050 0.000, 0.949, 0.001,

0.000 0.001, 0.050, 0.949,

)(T

This system is depicted in Figure 51A.

The eigenvalues of this system are 1, 0.997, 0.95559, and 0.94141 and

we will assume a lag time of 1 in arbitrary units. Thus, disregarding the eigenvalue of

one corresponding to the equilibrium distribution, there are three implied timescales:

332.785, 22.0139, and 16.5627.

We can write the probability of transitioning between two states as

(B1) /1 ep

where ω is the average timescale for the transition (this notation deviates from the

standard notation of τ but avoids confusion with the lag time). Rearranging, we find

)1ln( p

(B2)

Plugging our transition probabilities into this equation we arrive at the average

timescales for transitioning between each pair of states shown in Figure 51B. Many of

176

these timescales are as high as 1,000 units, much greater than the largest implied

timescale of ~332 units. In principle, one could monitor these average timescales,

resulting in apparent timescales longer than the implied timescales of the system.

Figure 51. Graph depiction of the model system defined in Appendix B with edges labeled by A) their

probability and B) their average timescale under a two-state assumption.

177

APPENDIX C: SUPPORTING INFORMATION FOR CHAPTER 3

MOLECULAR DYNAMICS SIMULATION

Distributed molecular dynamics simulation on GPUs were performed using an

accelerated version of GROMACS (255) written specifically for GPUs (80) using the

Folding@Home platform (79). The AMBER ff96 (60) forcefield was used with the

generalized Born/surface area (GBSA) implicit solvent model of Onufriev, Bashford

and Case (81). AMBER ff96 has been reported to have more accurate secondary

structure propensities when used with GBSA (82). Up to 10,000 parallel simulations

(each with randomized Boltzmann-distributed initial velocities) were simulated at

300K, 330K, 370K and 450K, from several different initial starting states. Due to the

nature of distributed computing, in which uncoupled simulations are used to produce

successive trajectory segments, a broad distribution of trajectory lengths is obtained

(see Figure 11b in the main text). Stochastic integration was performed using a time

step of 2 fs and Berendsen temperature coupling. A water-like solvent (shear)

viscosity of 91 ps-1 was used, with full O(N2) electrostatic and vdW interactions.

Hydrogen bond lengths were constrained using the SHAKE algorithm. Trajectory

snapshots were recorded every 1 ns.

Starting conformations for the native state of NTL9 were taken from the

crystal structure 1DIV, and steepest-descent minimized for 5000 steps. (Minimization

was done using the GBSA model of Still et al. (256)) Five starting conformations for

the random coil ensemble were taken from snapshots of Monte Carlo trajectories in

which dihedral angles were randomized under a potential rewarding compact Rg. The

dihedral probabilities came from the TOP500 database (257). Starting conformations

for extended structures were constructed by setting dihedral angles to their canonical

values.

178

MARKOV STATE MODEL (MSM) CONSTRUCTION

We used the MSMBuilder package (10), modified to use sparse matrices (4), to

construct an MSM for NTL9(1-39). First, 100,000 microstates were generated by

clustering conformations separated by 10 ns. The remaining 90% of the data was then

assigned to these clusters. The resulting microstates had an average radius of ~4.5 Å,

where the radius of a cluster is defined as the largest distance between any

conformation in that cluster and the cluster center. The implied timescales were then

calculated for lag times from 1 to 32 ns at 4 ns intervals and found to level off at ~12

ns (Figure 52a), implying a 12 ns Markov time. Finally, we generated a macrostate

model (Figure 53) by lumping microstates into 2,000 macrostates using the PCCA+

algorithm (44) and verified that the implied timescales still leveled off on a similar

timescale (Figure 52b).

We have confirmed the statistical accuracy of our equilibrium populations for

the 2,000 state model using a Bayesian method (251). This analysis reveals that the

statistical uncertainty in the population of any state ranges form 0.2% to 2% of that

state's population (0.7% on average). Unfortunately, it is not possible to rigorously

address any systematic error in our model without an independent data set to compare

to.

TRANSITION PATHWAY THEORY (TPT) ANALYSIS

Many of our calculations are modeled on those in (3, 87, 258). However, we have

chosen a slightly different algorithm for decomposing the reactive flux into individual

pathways. Given a folded state B, an unfolded state A, and the matrix of net reactive

flux F, our greedy backtracking decomposition works as follows:

1. Start at the folded state B. Label this state x1.

2. Choose the state whose net flux into x1 is maximal.

179

3. Next, choose the state x2 such that the net flux from x1 to x2 is maximal.

We repeat this process for each state xn-1, choosing the next state xn such that the flux

from xn-1 to xn is maximal, until we reach state xn = A.

Upon completion, we have produced a series of states (x1, … , xn) defining a

pathway. We define the flux along this pathway as the minimum of the fluxes,

min(F(xi xi+1)). We then subtract this flux from each of the pathway's edges in the

original flux matrix. Finally, we repeat the same algorithm on the new flux matrix to

produce additional pathways.

The result of this algorithm is a set of pathways and their associated fluxes.

STRUCTURAL ANALYSIS OF MACROSTATE ENSEMBLES

Because macrostate conformational ensembles can be somewhat heterogenous and

diffuse, we used a metric that quantifies the extent of native-like structure without

using predetermined reaction coordinates or requiring artificial thresholds for native

contacts, which we call the Q-value.

For each macrostate, we define a vector c(x) indexed by x = (i, j), denoting a

contact between residues i and j. The entries of c(x) are continuous (non-integer)

values between 0 and 1, representing the fraction of the ensemble for which the alpha-

carbons of residues i and j are closer than 8Å. We will call c(x) a contact profile. We

define the Q-value of a given c as its projection onto the contact profile of the “native”

macrostate (state n), cnat.

180

The Q-value for the “native” macrostate (state n) is unity, and less native-like contact

profiles will have lower Q-values. Because a contact profile can only contain entries

between 0 and 1, Q is always positive.

Moreover, we also define Q-values for particular structural elements by

restricting a contact profile to a particular subspace of contacts. For example Q β12 is

the Q-value when c is restricted to a subspace where x β12, a set of contacts

corresponding to pairings between beta-strands β1 and β2. We examined Q-values for

three native structural elements: Qα, Qβ12, and Qβ13, based on the subsets of the

“native” (state n) contact profile (Figure 54). For clarity, we call the Q-value for the

entire set of contacts Qtotal.

ANALYSIS OF STATES ALONG FOLDING PATHWAYS: COMPARISON

BETWEEN SECONDARY STRUCTURE FORMATION AND REACTION

PROGRESS (PFOLD)

How heterogeneous are the possible pathways for folding? One way to examine this

question is to compare the secondary structure formed in a given state versus its

position along the reaction pathway. In Figure 55 and Figure 56, we use a simple

metric to plot the secondary structure bias, namely the difference between alpha

helical and beta sheet contacts Qα – (Qβ12 + Qβ13)/2 of a given state and compare this

to the position of the state along the reaction pathway as determined by its commitor

or pfold value. From these figures, it can be seen that 1) the “unfolded” state (a)

contains residual native-like helical propensity, and 2) pathways involving various

ordering of native-like helix and sheet formation are possible.

The contact profiles (see Figure 55) for these states demonstrate the existence

of non-native contacts in some states as well as the fact that certain contacts are

present more commonly in a given state. For example, states h, i, j, and k all have a

181

mixture of some contacts which are very prevalent (dark black) and some which are

only partially formed (light gray to gray), whereas state g has fewer contacts, but all

prevalent. The nature of the heterogeneity even within a state highlights the ensemble

nature of this form of analysis, as well as the degree to which a given state is

structured (and in which parts).

Finally, we find a variety of degrees of structure in a given state for the natural

independent folding units found (i.e. the alpha helix, β12, and β13). This is shown in

Figure 57, where we see a significant diversity present in state and pathways in terms

of the secondary structure formed.

HOW DOES NTL9 FOLD IN OUR SIMULATIONS?

In order to understand how NTL9 folds, a natural approach is to analyze the pathways

found in terms of existing theories for protein folding. The highest-flux pathways in

our mesoscopic model are a→m→n and a→l→n. Both pathways are direct routes

from disordered to highly-structured macrostates, reminiscent of a nucleation-

condensation mechanism (259). This picture is consistent with the cooperative two-

state kinetics observed in stopped-flow re-folding experiments (78). While these

pathways show concomitant formation of helix and hairpin structures, the intervening

states l and m differ (mostly) in the β12 hairpin registration (see Figure 57). The large

pfold values of states l and m, and their obligate presence in the two highest-flux

pathways from a to n, suggests that to some extent, the states l, m and n can be

considered a very native-like “molten-globule”, in which the details of tertiary

arrangement are sorted out after overcoming the main barrier to folding. Kinetics

between such metastable stables would be difficult to detect experimentally using a

single fluorescent reporter, and in the nucleation-condensation view such events might

be described using the encompassing term “condensation”.

At the same time, the structural diversity along the folding pathways we

analyze corresponds well with many models of hierarchical folding. In general, the

182

macrostates with low pfold values have a baseline of native helicity, with the full

extent of native beta-sheet structure occurring later in the folding reaction (Figure 56).

This is consistent with the idea that local structures such as helices form early, with

non-local structures such as beta-hairpins and beta-sheets forming later in the reaction.

Macrostates b through f (which have low pfold values and are involved early in the

folding reaction) contain a variety of distinct non-native structural elements,

particularly non-native hairpin and sheet arrangements (see Figure 13, and Figure 55).

This is reminiscent of hierarchical mechanisms such as diffusion-collision (260) where

competing ‘foldons’ (86) form as kinetically metastable units, and are cooperatively

stabilized when in a native-like arrangement. The heterogeneous sequences of

secondary structure formation in pathways a→h→k→m→n (in which the central helix

forms first) versus pathway a→g→l→n (in which the hairpin structure forms first)

suggest that independent folding units can form and coalesce in any order.

We stress that there need not be a single pathway or single, dominant

mechanism for folding. Moreover, the various theories proposed for how proteins fold,

such as a diffusion-collision or nucleation-condensation mechanism, are based on

physical principles broadly relevant for proteins. Therefore, it is natural to imagine

that multiple mechanisms could be simultaneously present, but that the sequence of

the protein, coupled with the chemical environment (solvent conditions, temperature,

pH, etc), would control the balance of the degree to which each mechanistic pathway

is seen.

183

Figure 52. (a) Implied timescales for a series of 100,000-microstate Markov State Models (MSMs)

built at lag times between 1 and 32 ns. As the longest timescale levels off beyond a lag time of 10

ns, a lag time of 12 ns was chosen to build subsequent MSMs. The spectral gap present at all lag

times indicates apparent two-state folding kinetics. (b) The implied timescales for a 2000-

macrostate model built by lumping states from the microstate MSM show a similar spectral gap

and leveling off of time scales. The faster implied timescales of the macrostate model at short lag

times are due to lumping effects. (c) The 10 slowest implied timescales for the 2000 state models,

with error analysis from a bootstrapping procedure. Error bars represent the standard deviation

from the bootstrap analysis.

184

Figure 53. A scatter plot of the 2000 macrostates obtained by lumping the 100,000-state MSM

calculated from the simulation data at 370K. The RMSD-to-native is calculated using the peptide

backbone residues, with respect to the native starting state. The free energy of each microstate i is

computed as –kT ln (pi /p0), where pi is the equilibrium probability of the microstate, and p0 is an

arbitrary reference (in this case, max(pi)). Shown in red are the 14 macrostates transited by the top

ten pathway fluxes, labeled with the same letters as in Figure 13. In this mesoscopic view, we find

that 1) the macrostates are diffuse collections of conformational states, 2) there are multiple folding

pathways along these metastable states, and 3) we can identify highly populated “native” (state n)

and “unfolded” (state a) macrostates that dominate the observed relaxation rates. The red arrow is

meant to guide to eye in illustrating a “mesoscopic” view of the transition state barrier: the

“unfolded” state (a) and “native” state (n) are at free energy minima, while intermediate RMSD

values have macrostates with higher free energies.

Figure 54. Contact profile subspaces used to calculate Q, Q12, and Q13, which quantify the extent of

native-like structuring for beta-strand 1 and 2 pairing, beta-strand 1 and 3 pairing, and helix

formation, respectively.

185

Figure 55. Here, contact profiles (see definition above) for the 14 macrostates involved in the top ten

folding pathways are plotted in a similar fashion to Figure 55. For clarity, the pathway arrows have

been removed. Each contact profile is a 39 x 39 matrix of inter-residue contacts, showing the

contact fraction on a linear grayscale from 0 (white) to 1 (black).

Figure 56. Here, values of Q (yellow), Q12 (red), and Q13 (blue) are plotted in a bar graph for each of

the 14 macrostates involved in the top ten folding pathways. The layout is in a similar fashion to

Figure 56.

186

Figure 57. Macrostates l, m and n (the “native” state) have very similar structural ensembles and similar

pfold values (pfold > ~0.93). To examine the subtle differences in their macrostate contact profiles,

we computed difference contact profiles for (l-m), (n-l) and (n-m) transitions. These difference

maps reveal that these states differ mostly in their hairpin registrations and packing of the hairpin

loop.

187

APPENDIX D: SUPPORTING INFORMATION FOR CHAPTER 4

VILLIN MSM

LUMPING INTO MACROSTATES

To identify metastable states in villin, we lumped kinetically related microstates into

500 macrostates (all having self-transition probabilities >0.5) using the PCCA+

algorithm (20, 261). Figure 58 shows the implied timescales for this macrostate MSM.

While they are somewhat shorter than at the microstate level (4), their leveling off at

lag times of 10-15 ns indicates that the model is Markovian at these timescales (6, 34).

Mean first passage times (MFPTs) between pairs of states can then be

calculated as in Ref (161). The equilibrium probability of each state can be obtained

by normalizing the first eigenvector of the transition probability matrix. Finally, the

relative entropy between two MSMs is calculated as in Ref (18)

N

ji ij

ijiji Q

PPPQPD

,

log)||(

where Pi is the equilibrium probability of state i, Pij is the probability of transitioning

from state i to state j during one lag time, N is the number of states, P is the reference

model, and Q is a test model (in this case generated from a subset of the data).

Figure 62 shows the relative entropy for varying numbers of simulations (up to

40,000) of a given length (up to 400 nanoseconds). This figure highlights the fact that

too small numbers of too short simulations are less valuable than a single long

simulation with an equivalent aggregate amount of data but that the simulation length

at which this breakdown occurs decreases for increasing numbers of simulations.

188

ights between states

under the sim

transition count matrix with Cij giving the number of observed i→j transitions and

Ci th

P tra by normalizing each row of C (

BARRIER HEIGHTS

As further confirmation of the relevance of our macrostates, we have developed a

simple Bayesian approach for estimating the free energy barrier he

plifying assumption that we can examine pairs of connected states

independently. To begin with, we make the following defintions:

C

e number of counts originating in state i

i

ijij C

CP ) nsition probability matrix obtained

‡ijG barrier height for i→j transitions

ij attempt frequency for i→j transitions

ijn number of attempted i→j transitions

ijk rate of i→j transitions

Bk Boltzmann’s constant

T temperature (300 K for this work)

lag time of the MSM

The quantity we wish to obtain is the posterior distribution over barrier heights

given our data (the count matrix, C). However, to account for the attempt frequency

we must begin with

189

)(

),(),|()|,(

‡‡‡

P

GPGCPCGP ijijijij

ijij

C

where the equality comes from applying Bayes’ rule. We can then integrate out the

attempt frequency to obtain

)(

),(),|()|(

‡‡

CP

GPGCPCGP ijijijijij

ij

‡

We now assume that the barrier height and attempt frequency are independent

and assign them each a uniform prior.

lso put bounds on the attempt frequency from our observed data by

recognizing that the number of attempts is at least as many crossing events as were

observed and no greater than the number of observed crossings plus the number of

self-transitions.

cP

PGPGP

ij

ijijijij

)(

)()(),( ‡‡

cGP ij )( ‡

where c is a constant.

We can a

i

iiijij

i

ij

C

CC

C

C

where the denominator gives the total time spent in state i and the numerator gives the

number of attempts at transitioning from state i to state j. Using nij to denote the

number o pts at transitioning from state i to state j, we can also write f attem

i

ijnij C

190

Using these priors and bounds, we obtain

ijij cnij CP )(

iiij

ij

cc ijiji

ijijij

ijijijijij

nPGPC

nGCP

CGP

CP

PGPGCPCGP

)()(),|(

)|(

)(

)()(),|()|(

‡‡

‡

‡‡‡

Given a particular value of the attempt frequency (or equivalently, the number of

attempts) we can write

ijijij

ijij

Cnij

CijCn

i

ijijij PPC

C

nGCP 1),|( ‡

where denotes “n choose c”. Using the simple rate equation ijij Cn C

ijij kP exp1

and, from transition state theory,

)exp(‡

TkBijij

Gk ij

we finally obtain

ijij

Biji

ijij

Biji

ij

ijij

ijijBijij

ijBijij

ijij

CnTkG

C

nCTkG

C

n

CniC ij

ijij

CnTkG

CTkG

Cni

ijijij

eeCn

GCP

eeCC

nGCP

‡‡

‡‡

expexp‡

expexp‡

1),|(

1),|(

The denominator, P(C), is then obtained by normalization.

191

ted the expected barrier height for each

transition. The mean expected barrier height is 5.9 (+/- 2.5) kT, indicating that most of

tates are poten

As a consistency check, it is also possible to solve for the barrier height in

terms of the observed counts and attempt frequency.

Using these equations gives a posterior distribution of barrier heights for every

possible transition. Since there are thousands of possible transitions, it is impractical to

examine them all. Instead, we have calcula

the s tially detectable (separated by reasonable barriers) and that the

distribution of barrier heights is quite broad.

i

ij

iBij

C

nTkG ln‡

ij

C

C1ln

that these encompass the Bayesian result.

TRANSITION COUNT MATRICES

The transition count matrices for simple models S, P, and H (CS, CP, and CH

respectively) are

One can then plug in the lower and upper bounds for the attempt frequency and ensure

SIMPLE MODELS

000,9030000

3000,13000

03000,1300

003000,130

0003000,13

00003000,6

SC

192

000,9022000

2000,10220

20000,1220

022000,102

0220000,12

00022000,6

PC

000,9002220

0500,32000

22000,1200

202000,120

2002000,12

00002500,3

HC

where the entry in row i and column j gives the number of transitions observed from

state i to state j. State 0 is unfolded, 1-4 are intermediates, and 5 is the native state.

To generate synthetic simulations from a transition count matrix we first

normalize each row to obtain a transition probability matrix. At each time step (or

each lag time), the next state is chosen according to the distribution of transition

probabilities for the current state.

FOLDING SIMULATIONS

MFPTs from the unfolded state(s) to the native state, given in Table 1, were calculated

following Ref (161). The distribution of first folding times was determined by first

running 10,000 simulations of 50,000 steps each started from state 0 (for model H half

the simulations were started from state 4). The first folding time of each simulation

was then calculated and these values were plotted as a histogram with 100 bins. The

lag phase was determined by finding the first folding time with the maximum

probability. Exponential fits were calculated by fitting to the first 50 bins after the lag

phase (to avoid noise in less populated bins at longer first folding times). The

exponential fits and lag phases are also given in Table 1. Similar results were obtained

193

by randomizing matrix elements while maintaining the network topology, subject to

the constraints of detailed balance and metastability. Example matrices include

000,9030000

3000,13000

03500300

003000,130

0003000,33

00003000,6

,randSC

and

000,9004292700

0500,397000

4297000,11700

92017000,1680

700068000,151

000051500,3

,randHC

194

Figure 58. Implied timescales for the villin macrostate MSM.

Figure 59. Distribution of MFPTs between all pairs of non-native states for villin (A) on a linear scale

to demonstrate the peak does not shift significantly relative to the distribution shown in Figure 18B

and (B) on a log scale to highlight that the tail of the distribution does extend to about 60 ns.

195

Figure 60. Distributions of the MFPTs (A) from each non-native state to the native state and (B)

between every pair of non-native states for our 2,000 state NTL9(1-39) model. As discussed in Ref

(93), further refinement of this model is likely necessary. However, we do not expect the

qualitative trend of long timescales (relative to folding) for transitioning between unfolded states to

change.

Figure 61. Two conformations from different unfolded basins demonstrating the structural

heterogeneity of non-native states (especially in their non-native contacts) that, in combination with

the vastness of conformational space, result in slow transitions between unfolded states. The

structures are colored red to blue from the N-terminus to the C-terminus. Atoms for residues Arg

14, Trp 23, and Lys 32 are shown to highlight that 23 and 32 are in contact on the left while the

196

chain has rearranged such that 14 and 32 are in contact on the right. These images were made with

VMD (67).

Figure 62. Relaxation of the fraction folded starting from equally populated unfolded states (black is

data and blue is single exponential fit with τ≈810 ns). The beginning of the curve is dominated by

single exponential relaxation but deviations from this apparent two-state behavior become apparent

later.

197

Figure 63. Relaxation of the fraction unfolded for a villin model at the microstate level (thick black

line) and a biexponential fit (thin blue line) with time constants of ~60 and ~415 ns, at least

qualitatively consistent with time constants of ~70 and ~720 ns from experiment (56). We hope to

explain this behavior in a future work on villin. As in Ref. (4), the native state was defined as all

microstates with an average Cα RMSD to the crystal structure less than 3 Å.

198

Figure 64. The distance to the gold-standard model, measured via the relative entropy, for 40,000

trajectories up to 400 nanoseconds in length. The black lines are contours of equal amounts of

data. Again, there was insufficient data to resolve the upper right-hand corner of the plot.

Model Exponential Fit MFPT Lag Phase S 12,800 13,400 2,500 P 4,500 5,000 1,000 H 3,300 3,600 800 Table 1. Exponential fits, MFPT’s, and lag phases (all in units of steps) for transitioning from the

unfolded state(s) to the native state in the three simple models.

199

APPENDIX E: SUPPORTING INFORMATION FOR CHAPTER 5

SIMULATION DETAILS

Six initial starting conformations covering a range of 0 to 13 Å Cα RMSD to the

crystal structure were drawn from replica exchange simulations in implicit solvent

from Bill Swope and Jed Pitera at the IBM Almaden Research Center (136). These

conformations were energy minimized using a steepest-descents algorithm in the

Gromacs simulation package (43) with the AMBER03 force field (60). They were

then solvated in tip3p water and the solvent was equilibrated at 300 K with the protein

coordinates held fixed. Finally, simulations were run on the Folding@home

distributed computing platform using an MPI-enabled version of Gromacs (58) at both

300 and 370 K. The details of this procedure are identical to those used in Ref (58)

and a full description can be found there. Most of the results described in this work are

from the 370 K data. This temperature was chosen to approximate the experimental

melting temperature, correcting for the fact that simulations tend to over-estimate the

melting temperature for this system (136).

Structures were rendered with PyMOL.

MSM CONSTRUCTION AND ANALYSIS

We used the MSMBuilder package (4, 10) to construct a microstate model with 30,000

states and a coarse-grained macrostate model with 5,000 states. The microstate model

was generated by clustering conformations stored at 5 ns intervals based on their Cα

RMSDs using the k-centers algorithm in MSMBuilder. The remaining data (50 ps

spacing) was then assigned to these clusters and used to construct a transition count

matrix (Cij = the number of observed transition from state i at time t to state j at time

t+τ, where τ is the lag time of the model) and corresponding transition probability

matrix (Pij = probability of transitioning from state i at time t to state j at time t+τ,

200

where τ is the lag time of the model). The PCCA+ algorithm (20, 44, 261) was then

used to lump kinetically related microstates into 5,000 macrostates and these state

definitions were used to construct macrostate level transition count/probability

matrices.

The lag time for each model was selected by computing the implied timescales

of the model

)ln(

k

where μ is an eigenvalue, τ is the lag time, and k is a rate. This equation comes from

the equivalence between discrete time MSMs and continuous time master equations

(see Refs (6) and (3) for details). By plotting the implied timescales as a function of

the lag time one can identify the lag time at which they begin to level-off (satisfy the

Chapman-Kolmogorov test), indicating that the model is Markovian (34). Based on

this analysis, we chose a lag time of 5 ns for our microstate model (Figure 65), where

all the kinetic analyses in this work were performed.

To calculate the relaxation of the fraction folded as measured by some

observable we used the procedure from Ref (58) to distinguish folded and non-native

states and the procedure from Ref (4) to propagate the fraction folded. For example,

with the experimental surrogate (Trp22-Tyr33 quenching) we calculated the average

and standard deviation of the distance between these residues (Nativeave and Nativestd

respectively) in native-state simulations started from a model of D14A based on the

1LMB crystal structure. Five random conformations were drawn from each state and

used to calculate the average distance between these residues for that state (Stateave).

A state was considered to be native if Stateave < Nativeave - Nativestd and non-native

otherwise. The fraction-folded can then be calculated as the dot-product between a

vector with 1’s for folded states and 0’s for non-native ones with the state populations.

To mimic an ensemble T-jump we used two starting populations: 1) all states equally

populated and 2) all microstates in non-native macrostates (i.e. outside the most

201

populated macrostate) equally populated. The relaxation of these starting ensembles

was modeled by propagating the populations forward in time with the transition

probability matrix and calculating the fraction folded at each time step. The same

procedure was used for the fraction folded determined by the RMSD to the crystal

structure, which was examined to determine whether or not the Trp22-Tyr33 distance

could be measuring a more local rearrangement than full folding, as proposed for

villin (58). Figure 73 shows that these two observables gave similar timescales for the

full MSM and, while differences are apparent when the simulations started from β–

sheet structures are ignored, the timescales do not appear to be substantially slower for

the RMSD relaxation (Figure 74). The molecular and activated timescales (τm and τa

respectively) were obtained by fitting to the biexponential

CBeAe am tt //

where t is the time and A, B, and C are constants.

The states participating most strongly in a given transition mode are specified

by the corresponding left eigenvector (states with negative components are

interconverting with those with positive components, and the magnitude of the

eigenvector component gives the degree of participation) (1). The highest flux

pathways between sets of state were calculated as in Refs (258) and (5). Mean First

Passage Times (MFPTs) between states and Pfolds were calculated as in Ref (37).

Given our finite sampling, one can estimate the kinetic connectivity of a state

by counting the number of edges connecting it to other states (effectively a way of

counting the number of edges with probabilities above some threshold since all

connections would be made with infinite sampling).

Two residues were considered to be in contact if any pair of atoms was within

7 Å. Native contacts are those formed in the energy minimized model based on the

crystal structure 1LMB (130, 131). Solvent accessible surface areas were measured

202

using the g_sas program from Gromacs (43) with a 1.4 Å probe radius. The distance

between two residues is the distance between the centroids of their side chains.

Figure 65. Implied timescales for the full 370 K dataset.

Figure 66. Implied timescales for the 300 K dataset.

203

Figure 67. Implied timescales for ¾ of the 370 K dataset selected at random.

Figure 68. A coarse-grained view of the slowest transition with state sizes proportional to the free

energy and arrow widths proportional to the flux (see key in figure).

204

Figure 69. Another coarse-grained view of the slowest transition with state sizes proportional to the free

energy and arrow widths proportional to the flux (see key in figure). Here the states are laid out in

terms of the average number of β-sheet residues (calculated from 100 random conformations from

each state) and the pfold (probability of reaching the crystallographic state in L before the compact

β-sheet state in A).

206

Figure 70. Free energy projections of the microstate MSM onto typical order parameters like the radius

of gyration (Rg), the Cα RMSD to the crystal structure, and the distance between the Trp22 and

Tyr33 residues. Differences between the two panels highlight the difficulty in interpreting such

projections.

Figure 71. Free energy projection of the microstate MSM onto Pfold and the distance between the

Trp22 and Tyr33 residues. Obtaining projections onto kinetic order parameters like Pfold is greatly

simplified with MSMs. In this case Pfold refers to the probability of reaching the crystallographic

state before reaching the compact β-sheet state (i.e. the slow transition from Figure 21). Unlike the

projections in, this one hints that D14A may not be well described by a simple two- or three-state

model or that the Trp22-Tyr33 distance is not a good reaction coordinate, since there are a broad

range of Pfold values possible for a given Trp-Tyr distance. Indeed, analysis of the MSM reveals

that D14A is best described by a native hub.

207

Figure 72. The ten most populated macrostates with their equilibrium probabilities.

Figure 73. Relaxation of the fraction unfolded with different observables and observation times. The

thick black curves come from the MSM and the thin blue curves from biexponential fits to the

MSM relaxation. The top row shows relaxation of the fraction unfolded measured by the Trp22-

Tyr33 distance (A) starting from all states being equally populated and (B) starting from all non-

native states being equally populated. The bottom row shows relaxation of the fraction unfolded

measured by the Cα RMSD to the crystal structure (C) starting from all states being equally

populated and (D) starting from all non-native states being equally populated. Fitting parameters

208

are given in the figure (in units of microseconds). In this case, the fitting parameters are relatively

independent of the observable and starting distribution.

Figure 74. Relaxation of the fraction unfolded with different observables and observation times from an

MSM built without the trajectories started from β-sheet structures. The thick black curves come

from the MSM and the thin blue curves from biexponential fits to the MSM relaxation. The top row

shows relaxation of the fraction unfolded measured by the Trp22-Tyr33 distance (A) starting from

all states being equally populated and (B) starting from all non-native states being equally

populated. The bottom row shows relaxation of the fraction unfolded measured by the Cα RMSD to

the crystal structure (C) starting from all states being equally populated and (D) starting from all

non-native states being equally populated. Fitting parameters are given in the figure (in units of

microseconds). In this case the fitting parameters are more dependent on the observable, consistent

with the experimental observation of probe dependent kinetics.

209

Figure 75. Projection of the free energy onto pfold (A) from the compact β-sheet state in Figure 22A to

the native state in Figure 22H, (B) from the extended state in Figure 22E to the native state in

Figure 22H, and (C) from the extended state in Figure 22E to the native state in Figure 22G. None

are purely downhill, though some may be consistent with incipient downhill folding (i.e. have

sufficiently low barriers that there is a reasonable population at the barrier top that can fold in a

downhill manner in addition to activated folding across the barrier).

Figure 76. The helicity of each residue predicted from Agadir.(143) The purple, numbered bars show

where the five helices are (the extra purple block between helices 4 and 5 is a turn).

210

APPENDIX F: SUPPORTING INFORMATION FOR CHAPTER 6

Figure 77. Uncertainty in the log base 10 of the relative entropies averaged over 10 independent

samples of (A) reference simulations of M1 and (B) adaptive sampling of M1. Black lines are

contours of equal amounts of data.

Figure 78. Uncertainty in the log base 10 of the relative entropies averaged over 10 independent

samples of (A) reference simulations of M2 and (B) adaptive sampling of M2. Black lines are

contours of equal amounts of data.

211

APPENDIX G: SUPPORTING INFORMATION FOR CHAPTER 9

SERIAL REPLICA EXCHANGE (SREMD)

Molecular Dynamics (MD) is a powerful technique for exploring the conformational

space of biomolecules. However, MD simulations often spend a significant portion of

time trapped in local free energy minima. Replica Exchange Molecular Dynamics

(REMD) (22, 23) was developed to overcome this problem by inducing a random

walk in temperature space. In REMD, independent MD simulations are performed in

parallel at different temperatures. At regular intervals attempts are made to exchange

configurations between temperatures. These exchanges are accepted according to a

well defined transition probability. The REMD scheme requires synchronization of

different processors, which makes it unsuitable for a heterogeneous distributed

computing environment.

Serial Replica Exchange Molecular Dynamics (SREMD) (177, 224) is a serial

version of REMD that is suitable for distributed computing. In SREMD, a single

simulation performs a random walk in temperature space by making regular attempts

to swap temperatures. The transition probability for this move is determined by one

potential energy from the simulation and a second one from a pre-stored potential

energy distribution function (PEDF) at the new temperature. SREMD has been shown

to be an efficient sampling method when applied in a distributed computing

environment (177). However, we note that SREMD is only approximately correct

unless the exact PEDFs are adopted.

SIMULATION DETAILS

Our simulations used the AMBER 94 potential (262). The SREMD algorithm was

implemented in a version of the GROMACS (43) molecular dynamics simulation

package modified for the Folding@Home (79) infrastructure (http://folding.

212

stanford.edu). The RNA molecule was solvated in a water box with 3943 TIP3P (263)

waters and 11 Na+. The simulation system was minimized using a steepest descent

algorithm, followed by a 100ps MD simulation applying a position restraint potential

to the RNA heavy atoms. All simulations were run with constant NVT by coupling to

a Nose-Hoover thermostat with a coupling constant of 0.02ps-1 (63). A cutoff of 10 Å

was used for non-bonded interactions. Long-range electrostatic interactions were

treated with the Particle-Mesh Ewald (PME) method (264). Nonbonded pair-lists were

updated every 10 steps with an integration step size of 2 fs in all simulations. All

bonds were constrained using the LINCS algorithm (265) .

2,800 SREMD simulations with an aggregate simulation time of 54.6 µs

starting from the NMR structure (PDB code 1ZIH) (209) were performed. The

temperature list was roughly exponentially distributed, with 56 temperatures covering

a range from 285 to 592K. To obtain initial estimates of the PEDFs, we performed 56

3ns SREMD simulations where every move was accepted. For the Folding@home

(FAH) runs, the initial temperatures were uniformly selected from the temperature list.

Thus, there are 50 simulations starting from each temperature, each with different

initial velocities. The PEDFs were updated every 40ns for 40 iterations, then every

400ns for 20 iterations, and at last every 1000ns.

TOPOLOGICAL METHOD (MAPPER) FOR PATHWAY ANALYSIS

Our SREMD simulations generate a massive number of configurations. Therefore, it is

difficult to discern the structure of the data. Such data is normally dominated by the

folded and unfolded structures. However, we are interested in understanding structures

in transition states or intermediate states. Direct application of clustering algorithms to

all the configurations will be biased toward the densest regions (i.e. folded/unfolded

states in this study), making it difficult to identify the sparsely populated intermediate

states of interest. Furthermore, such clustering methods will not provide any

information on the connectivity between different clusters.

213

To address such issues, Yao et al. (228) proposed a topological data analysis

method to explore pathways in biomolecular folding, based on Mapper14, a general

topological data analysis tool for high dimensional data sets. This method efficiently

identifies intermediate states along a pathway. Roughly speaking, we use Mapper with

filters based on some conditional density function estimated from the data. Then the

data is divided into overlapping level sets based on the filter. Single-linkage clustering

is then used within each density level. Finally a graph is generated with a node

corresponding to each cluster and edges between pairs of nodes in neighboring level

sets that have non-zero overlap.

We note that clusters may be intrinsically non-convex in biomolecular folding

problems. K-means type clustering algorithms will fail for such clusters. The use of

single-linkage clustering in density levels in Mapper allows the efficient discovery of

non-convex clusters and separates sparsely populated intermediate states from the

dominant unfolded/folded states. For details on how such a scheme works, readers are

referred to [13].

PEDFS

Figure 79 (a) shows SREMD PEDFs from our massive distributed computing

simulations. The convergence of the PEDFs can be verified by the 2 convergence

measure. The 2 convergence measure is defined as an integrated error as shown

below (224),

2 2

1

( ( ) )N

refi i

i

P t P

where N is the number of bins in the potential histogram, Pi(t) is the value of the ith

bin of the potential energy histogram generated by potential energies collected over

time t at a particular temperature, and Prefi is the reference PEDF.

214

Figure 79 (b) displays the 2 convergence measure averaged over all

temperatures. When the final PEDFs are used as the reference distributions, 2(t)

decays to zero. On the other hand, when the PEDFs from the initial 3ns constant

temperature simulations (Pinitial) are used as the reference, 2(t) grows to a plateau

value. The 2(t) values for single temperatures show the same trends as these

averaged values. Therefore, the PEDFs have converged.

MELTING CURVES

Figure 80 shows the native contacts melting curve. The data demonstrates that folded

conformations dominate at low temperatures while extended structures dominate at

high temperatures.

Figure 79. (a) Potential Energy Distribution Functions (PEDFs) generated from Folding@home data at

each of the 56 temperatures used. (b). The 2 convergence measure averaged over all temperatures

as a function of time. Triangles correspond to using Pfinal as the reference distribution and circles

correspond to using Pinitial as the reference.

215

Figure 80. Native contacts melting curve. Only every third temperature is displayed for clarity.

216

APPENDIX H: SUPPORTING INFORMATION FOR CHAPTER 10

INITIAL CONFIGURATIONS

We started our ST simulations from two different initial configurations as shown in

Figure 81: a near-native state and a random coil. The near-native state was created by

analogy to the NMR structure of the GCAA tetraloop (first structure of PDB code 1zih

(209)). The random coil conformation was created with the Nucleic Acid Builder

(266).

Figure 81. The two initial structures used in this study: A) A near-native conformation and B) a random

coil conformation.

THE CONVERGENCE OF WEIGHTS IN SIMULATED TEMPERING (ST)

SIMULATED TEMPERING

In Simulated Tempering (ST) (24, 25), configurations are sampled from a mixed

canonical ensemble in which the canonical ensembles with different temperatures are

weighted differently as defined by a generalized Hamiltonian:

( , ) ( , )i i iX p H X p g (H1)

217

)

Where βi =1/(kBTi), H(X, p) is the Hamiltonian for the canonical ensemble at

temperature Ti. X denotes the conformation and p is the momentum. A priori

determined constant gi is the weight for the temperature Ti.

ST works as follows: a single simulation starts from a particular temperature

(Ti) and an attempt is made periodically to change the configuration (Xn) to another

temperature (Tj) according to a well defined transition probability by satisfying the

detailed balance condition.

( , ) ( ) ( , ') (i n n j n nP X p P i j P X p P j i (H2)

The probability of configuration Xn at temperature Ti for the expanded canonical

ensemble is,

1 1( , ) exp( ( , )) exp( ( , ) )i n n i n n i n n iP X p X p H X p g

Z Z (H3)

where pn is the momentum and Z is the partition function for the expanded canonical

ensemble. is the sum of kinetic energy (K) and potential energy (U), and

( , )n nH X p

) ( )nK p ( , ( )n n nH X p U X

A re-scaling of the momentum ( ' /n j i np T T p ) following the exchange

causes the kinetic energy to cancel out in the detailed balance equation, and the

transition probability after applying the Metropolis criterion is shown below,

( ) ( ) ( )min1, j i n j iU X g g

i jP e (H4)

where U(Xn) is the potential energy for configuration Xn, which is sampled from the

canonical ensemble at Ti. A set of weights need to be pre-determined to calculate

these transition probabilities. Without proper weighting, ST simulations will be

constrained to a subset of the temperature space and become inefficient (25, 177). It

218

was shown that weights leading the system to perform a random walk in temperature

space equal the unit-less free energies at different temperatures (24, 25).

SIMULATED TEMPERING EQUAL ACCEPTANCE RATIO (STEAR) METHOD

It is not an easy task to determine the free energy weights enabling system to perform

a random walk in temperature space. The Simulated Tempering Equal Acceptance

Ratio (STEAR) method for determining the free energy weights is adopted in this

study (49, 177). This method is based on the property that the free energy weights

leading to uniform sampling must yield the same acceptance ratios for both forward

and backward transitions from Ti to Tj as shown below.

( , ) ( , )i j j i i j i i j jP g g U P g g U (H5)

where

( , ) ( )

( , ) ( )

i j i j j i i i i

j i j i i j j j

P P g g U P U d

P P g g U P U d

j

U

U

(H6)

where Ui is the potential energy for a configuration sampled from the canonical

ensemble at temperature Ti and P(Ui) is the potential energy distribution function

(PEDF) at Ti. PEDFs for each temperature are initially estimated from short trial MD

simulations and then updated during an equilibration phase preceding the production

phase, which uses a static set of weights. By solving Eq. 7.3, we can obtain a set of

near free energy weights.

DETAILED PROCEDURE TO UPDATE THE WEIGHTS

The ST algorithm was implemented in version 3.1.4 of the GROMACS (43) molecular

dynamics simulation package modified for the Folding@Home (79) infrastructure

219

(http://folding.stanford.edu). In our ST simulations, the temperature list (T1 … Tn)

containing 56 temperatures is roughly exponentially distributed between 270 and 592

K. The detailed procedure to determine the weights using STEAR is described as

below

Obtaining the initial weights: For each of the two initial configurations (see

Figure 81), one 2 ns NVT simulation was carried out at each of 56 temperatures on a

computer cluster. Potential energies collected every 0.1 ps from the last nanosecond of

these simulations were used to get a rough approximation of the energy distribution at

each temperature. The weight (gi) that gives an equal acceptance ratio for transitions

from Ti to Ti+1 and vice versa is found using Newton’s method (See Equation (H5))

and g1 is set to zero.

Updating the weights: Once an initial set of weights has been chosen, we start

1120 ST simulations from each initial configuration on the Folding@Home distributed

computing environment. In these simulations, a temperature swap is attempted every

0.2 ps. At regular intervals (about every 300ns of simulation in total) all the new data

is collected and only new data is used to refine the approximation of the energy

distribution at each temperature. Newton’s method is then used to update the weights

to satisfy the equal acceptance ratio criterion given the new energy distributions as

shown in Equation (H4).

CONVERGENCE OF THE WEIGHTS

The weights obtained from two independent sets of ST simulations starting from

different initial configurations are converged well as shown in Table 2. The weights

converge at about 9 ns for each initial configuration. As described before, a set of

converged weights, i.e. free energy weights should induce a uniform sampling of the

temperature space. As shown in Figure 82, both sets of simulations achieve uniform

sampling at about 9ns. Thus, after about 9 ns, the weights are held static and the

simulations are continued in what is called the production phase.

220

Figure 82. Amount of sampling at different temperatures for ST simulations started from the native (top

row) and coil configurations (bottom row) computed from different segment of simulation time 0-

0.3ns, 1.2-1.5 ns, 2.7-3.0 ns, and 8.7-9.0ns are displayed. Uniform sampling is reached for both sets

of ST simulations indicating the weights are converged.

MOLECULAR DYNAMICS (MD) SIMULATION DETAILS

Our MD simulations used the nucleic acid parameters from the AMBER99 force field

(60, 267). The RNA molecule was solvated in a water box with 2543 TIP3P (263)

waters and 7 Na+ ions. The simulation system was minimized using a steepest descent

algorithm, followed by a 100ps MD simulation applying a position restraint potential

to the RNA heavy atoms. All NVT simulations were coupled to a Nose-Hoover

thermostat with a coupling constant of 0.02ps-1 (63). A cutoff of 10 Å was used for

both vdW and short range electrostatic interactions. Long-range electrostatic

interactions were treated with the Particle-Mesh Ewald (PME) method (264).

Nonbonded pair-lists were updated every 10 steps with an integration step size of 2 fs

in all simulations. All bonds were constrained using the LINCS algorithm (265).

HIERARCHICAL K-MEDOIDS CLUSTERING ALGORITHM

A hierarchical K-medoids clustering algorithm developed by Boxer, G. is used in this

study. In K-medoids clustering one starts by choosing some number of random

conformations to be generators. All remaining conformations are then assigned to the

generator that they are most similar to, thus forming a state corresponding to each

221

generator. Each generator is then updated by choosing a number of random

conformations from its corresponding state and selecting the one that is closest to

every other conformation in the state (i.e. the one that is closest to the center of the

state) as the new generator. This updating procedure may be continued for some

predetermined number of iterations or until the answer converges. The basic idea of

hierarchical clustering is to perform K-medoids clustering on the entire dataset and

then to recursively perform K-medoids clustering on each state until every state has

fewer conformations than some threshold. This threshold is set as an input parameter

for the K-medoids clustering algorithm.

Table 2. Convergence of the weights is shown for representative temperatures Δg = gj − gi obtained

from distributed computing simulations starting from a helical structure (third column) and a coil

structure (fourth column) at different temperature pairs. Differences between free energy

differences Δfji = gj/β j −gi/β i obtained from simulations starting from a helical structure and a coil

structure are displayed in the 5th column. KT at temperature i is shown in the sixth column.

Δfji(Helical)-Δfji(coil)(KJ/mol) is smaller than KT (KJ/mol) at all temperature pairs.

MARKOV STATE MODELS

A Markov model is basically a graph representing the structure and temporal

connectivity of some dataset that consists of temporally ordered observations (3, 6). In

this case, each node corresponds to a set of kinetically similar conformations. These

nodes are connected by directed edges with corresponding values equal to the

probability of transitioning between them. For the model to be Markovian, the

probability of transitioning to state j must depend solely on the previous state.

222

A Markov State Model (MSM) may also be represented by a transition

probability matrix as (also see Equ 1 in the main text)

( ) ( ) (0)P t T t P (H7)

where P(∆t) is a vector of state populations at time ∆t, T is the column-stochastic

transition probability matrix, and ∆t is the lag time (or time step). Using this

representation, the time evolution of a vector representing the population of each state

may be calculated by repeatedly left-multiplying the column vector by the transition

probability matrix. The model also has a corresponding lag time, which is effectively

the time resolution of the model. Each step, or multiplication by the transition

probability matrix, is equivalent to one lag time. For the model to be Markovian there

must be a separation of timescales. That is, equilibration within states must occur on

timescales faster than the lag time while transitions between states must occur on

timescales longer than the lag time. The key is finding an appropriate balance between

the number of states in the model and the lag time. A desirable Markov model has few

enough states that it may be understood by a person and a lag time shorter than the

timescale of the process of interest.

The eigenvalues (k) of the transition matrix each imply a time scale (k).

ln ( )kk

(H8)

where k is an eigenvalue of the transition matrix with the lag time .

The focus of the current study is thermodynamics instead of kinetics. The first

left eigenvector of the transition matrix Tij correspond to the equilibrium distribution

(6).

223

SPLITTING INTO MICROSTATES

The first step in our procedure to build an MSM is to divide all the conformations

sampled into small sets of structurally similar configurations called microstates (3, 6).

This is accomplished using the hierarchical K-medoids clustering algorithm described

in Section 3. For example, by setting the threshold for the hierarchical K-medoids

clustering to stop splitting a certain state as 2500 conformations, we divided 1.3

million conformations generated from long ASM seeding simulations into 1,597

microstates. Heavy atom RMSD is used as the distance metric, since it accounts for

both local similarities between pairs of conformations as well as global ones,. This

distance metric has also been shown to be able to distinguish between kinetically

distinct conformations. If the state population threshold is chosen to be small enough

then the conformations in one microstate may be considered to be kinetically as well

as structurally similar as it would require very few MD steps to get from one to

another. As shown in Figure 83, overlaid structures from the same microstate have

great structural similarity. Based on this assumption, one may build a microstate

Markov model by using the original data to calculate the probability of transitioning

between each pair of microstates (stored as a transition probability matrix). Because of

the small size of each microstate, this Markov model will have too many states to

provide any insight into the nature of the free energy landscape. To gain a clearer

understanding of the free energy landscape one may lump together kinetically similar

microstates to form macrostates. These macrostates comprise a new MSM that

hopefully has an appropriate separation of timescales.

224

Figure 83. Three example structures from a single microstate.

LUMPING INTO METASTABLE STATES

Lumping is done by first calculating the eigenvalues and eigenvectors of the

microstate transition probability matrix (44). The eigenvalues are related to the

timescale for interconverting between two sets of microstates while the corresponding

eigenvectors indicate which microstates constitute these two sets if the model is

Markovian at this timescale. We estimate the number of macrostates based on the gap

in the implied timescales (see Equation (H6)) of the microstate transition probability

matrix as a function of the lag time. As shown in Figure 84, there are six macro states

for the seeding simulations.

225

T

Figure 84. The largest one hundred implied timescales as a function of the lag time for (a) ST

simulations starting from the coil initial configuration. (b) The long adaptive seeding microstate

MSM.

Sets of kinetically related microstates are grouped together into macrostates

using a spectral clustering algorithm: Perron Cluster Cluster Analysis (PCCA) (45).

While generating the transition count matrix, all the recorded transitions are

independent (i.e. transitions from time t to 2t, 2t to 3t, etc). The initial lumping

calculated from this data is refined by using a Simulated Annealing (SA) scheme to

maximize the metastability (Q) of the model (6). Twenty SA runs of 20,000 steps each

are used. In each simulated annealing step, a microstate is randomly reassigned to a

new macrostate and the move is accepted using the Metropolis criterion. The

metastability is defined as the sum of the self-transition probabilities of each

macrostate ( ). Maximizing the metastability is assumed to be a good way for

maximizing the separation of timescales necessary for a valid MSM. The metastability

is shown in

1

N

iii

Q

Table 3.

N. Metastable States

Q <Pii>

ST (Native) 6 5.09 0.848

ST(Coil) 6 5.01 0.835

Seeding 6 5.61 0.935

Table 3. Metastability (Q) and average self-transition probability <Pi i> between metastable states for

the MSMs built from ST simulations and seeding simulations.

DETERMINING STATE POPULATIONS AND UNCERTAINTIES

Simulation trajectories are used to estimate transitions between different metastable

states in order to build a MSM. Such estimation induces uncertainties in any property

computed from the model including the metastable state equilibrium population we

pursued in this study. Therefore, obtaining the uncertainties is important to test the

reliability of our results. In order to estimate these uncertainties,, we employ a

226

Bayesian method introduced by Noe (251). Assuming that the system is Markovian at

the given lag time, the method defines the following stochastic model for its

parameters. The likelihood of any trajectory is simply the product of independent

transition probabilities, as a consequence of the Markov property, and the transition

probability matrix T is assigned an independent, symmetric Dirichlet prior in each

row. This is the conjugate prior for the Markov likelihood, which means that the

posterior distribution of T after observing a number of transitions has the same

functional form as the prior. This method makes the further assumption that the

system obeys detailed balance, so the distributions of T are restrained to the space of

reversible stochastic matrices. This distribution is difficult to normalize analytically,

but it may be sampled using a Markov Chain Monte Carlo (MCMC) algorithm. It was

shown (251) that the restriction to reversible matrices greatly reduces the uncertainty

of many thermodynamic properties, which is why it was deemed necessary in our

study. Using this method, we were able to sample from the posterior distribution of T,

given our simulation data, to obtain stable Monte Carlo estimates of the deviations of

equilibrium populations.

A SIMPLE MODEL OF NON-ARRHENIUS, METASTABLE DYNAMICS

SIMPLE POTENTIAL

GE algorithms attempt to overcome the sampling problem by inducing a random walk

in temperature space, where high temperatures help systems cross energetic barriers.

However, it has been shown that GE simulations will provide little improvement when

the folding kinetics are non-Arrhenius, and the dominant barriers are entropic at high

temperatures. In order to demonstrate the efficiency of the ASM in comparison with

the GE algorithms, we introduce a model 2D potential to fully contrast the

convergence of equilibrium statistics from the different algorithms. The model is

based on a discrete-state system introduced by Zwanzig (252) as a simple model for

protein folding, which is similar in sprit to continuous-space models used to study

227

anti-Arrhenius dynamics by the Levy group (241). These models define an energy

surface reminiscent of a golf-course, which is almost everywhere flat with some bias

toward the folded state and has a sharp decline near the folded state. On the other

hand, the degeneracy of the microstates increases sharply as we move away from the

folded conformation, providing an entropic advantage that stabilizes the unfolded

macrostate at higher temperatures.

The system of Zwanzig (252) was modified by introducing an additional,

uncoupled degree of freedom, which has the effect of creating intermediate states

between the folded and unfolded states. The energy as a function of the two

independent parameters S and R is

S,R S0 R0 R0 S0E =SU+RU- - +(2- ) (H9)

where 0,...., sS N and 0,...., RR N . The constant U determines the slope of the

energy function as we move away from the folded state along each coordinate;

represents the drop in energy when one of the coordinates becomes 0, while is the

depth of the energy well of the completely folded state, where both S and R equal 0.

The degeneracy of each microstate is given by:

,S RS R

S R

N Ng

SS

(H10)

With all this information, it is straightforward to analytically derive the partition

function

( , ),

0 0

2( (1 ) 1)( (1 ) 1)

S R

S R

N NE S R

S RS R

N NU U

Q e g

e e e e e e

(H11)

The equilibrium probability of each of the (NR+1)(NS+1)microstates is now easy to

compute by

228

( , )

( , )E S Re

P S RQ

(H12)

In the current study, we select parameters =4, =100, =1.5, U=1, and NR =

NS = 7 for our purpose of mimicking the non-Arrehnius folding kinetics. The

Potential of Mean Force (PMF) ( ln ( , )G P S R ) at a range of temperatures are

displayed in Figure 85. PMF plots suggest 4 metastable macrostates, shown in Figure

85 as separated by black dashed lines (the state decomposition will be discussed in the

next paragraph). The folded state where S = R = 0 (state 1), the unfolded state where

S>0 and R>0 (state 4), and two intermediate states where either S = 0 (State 2) or R =

0 (State 3).

Figure 85. Potential of Mean Force (PMF) for the simple potential at (1/KT) a. 0.995, b. 0.652, and c.

0.456. In part a, four metastable macrostates are separated by the dashed black lines and labled.

As expected, the free energy of the folded state decreases as we increase the

temperature, while the opposite is true of the unfolded state. This is also shown in

Figure 86 where the equilibrium populations of four macrostates are plotted as a

function of =1/kT. The populations of intermediate states 2 and 3 have low

229

populations at both low and high temperatures, but reach the maximum values at

medium temperatures with 0.65 .

Figure 86. Populations of four macrostates as function of =1/kT.

The potential was equipped with a discrete-time, Metropolis Hastings Monte

Carlo dynamics, where the proposal probabilities are proportional to the state

degeneracy for states where at least one of S and R change by 1, and zero for all

others. A Markovian transition probability matrix T was computed at each

temperatures, from which we obtained evidence for non-Arrhenius behavior and

metastability. The non-Arrhenius behavior can be seen in Figure 87 where we plot the

folding and unfolding rates at a function of temperature, computed as the inverse of

the mean first passage times between the folded and unfolded states. The mean first

passage times are computed using the method described by Singhal et.al. (37). The

unfolding rate increases with temperature. However, the folding rate decreases with

temperature due to the high entropic barriers for refolding at high temperatures.

Metastability for this system is confirmed by the large gap between the third and

fourth timescales implied by T as shown in Figure 88. At all temperatures, the third

largest timescale is at least a factor of 5 greater than the fourth implied timescale.

230

Therefore, we confirm that there is a separation of timescales for this system, and it

has four metastable macrostates. The first 3 implied timescales correspond to the

transitions between macrostates, while other shorter implied timescales correspond to

transitions within macrostates. State decomposition can be obtained by spectra

clustering algorithm Perron Cluster Cluster Analysis (PCCA) (45) and the resulting

definition of the four metastable states are shown in Figure 85 (a).

Figure 87. Folding (black) and unfolding (red) rates are plotted as a function of =1/kT.

COMPARING EFFICIENCY OF ASM AND GE USING THE SIMPLE POTENTIAL.

To test our hypothesis that GE algorithms, in particular Simulated Tempering (ST),

would exhibit a slower rate of convergence for equilibrium statistics than ASM, we

simulated 1000 trajectories of steps using each method. An optimal list of 10

temperatures with = 1.1, 0.995, 0.939, 0.89, 0.827, 0.652, 0.554, 0.519, 0.491, and

0.456 are selected for ST to obtain acceptance ratios bigger than 40% between all

neighbouring temperatures. The weights (g

66 10

i) are chosen analytically from the partition

function (177) to enable the system to uniformly sample every temperature.

231

ln ( )ig Q (H13)

An equal number of trajectories was started from each temperature, with

temperature change proposals done every 10 steps of simulation. Two independent

sets of ST simulations are performed with initial state 0 and 4 respectively.

Figure 88. Logarithms of the implied timescales as function of for the 2D potential are displayed.

The three slowest timescales are plotted using up triangle, down triangle, and cross points

respectively.

For ASM, we simulated 250 trajectories from each of the 4 macrostates at a

constant temperature of = 0.995, at which the folded state is the dominant state in

order to mimic the situation at physiological temperatures.

The convergence of the equilibrium populations from ST was analyzed in the

following way. For a set number of trajectories, we take a window of 50,000 steps,

and compute the fraction of the configurations at a certain metastable state and

temperature = 0.995 within this window. By bootstrapping this estimator 100 times,

we can determine distribution of the state populations as a function of simulation time

232

(see Figure 89). Populations obtained form the two independent sets of ST simulations

are converged between and steps. 52.5 10 53 10

Figure 89. Populations computed from Simulated Temperating (ST) simulations for four metastable

states of the are plotted as a function of length of the simulation. The reference populaiton is shown

in the solid lines and 1000 trajectories are used for this calculaiton. The error bars are the standard

derivation obtained from bootstrapping 100 times with replacement.

Similarly for ASM, we obtain a distribution for the equilibrium populations

with different trajectory length for a certain number of trajectories, which is computed

by a Bayesian method (251). As shown in Figure 86, it only takes about steps

for ASM to converge to the correct populations, which is much more efficient than

ST. The populations in

44 10

Figure 90 are computed using a lag time of 1/3 of the trajectory

length. However, we show in that the populations are almost invariant to the lag time

if it is longer than about 1/8 of the trajectory. We note that one has to choose a proper

lag time in order to get good estimate of the populations. A good lag time has to be

small enough so that there are enough transition counts, but not too small to have

many correlated transition counts. In our RNA hairpin example, we use a small lag

233

time but only a few transition counts are taken from each trajectory to make sure we

only consider independent transition events. In that case, we can still estimate

thermodynamic properties accurately even though the model is not Markovian under

the lag time used.

Figure 90. Populations computed from Adaptive Seeding Method (ASM) for four metastable states of

the are plotted as a function of length of the simulation. The reference populaiton is shown in the

solid lines and 1000 trajectories are used for this calculation. The lag time is selected as 1/3 of the

length of the simulation. The error bars are standard derivation obtained from a Bayesian method

(See section 2.5.3 for details).

To compare the efficiency of ASM and ST as a function of length and number

of trajectories, we define a criterion for the convergence as following: the probability

that the estimated populations for all states are within 5% of the actual equilibrium

populations is bigger than 80%. The population distributions are computed the same

way as in Figure 89 for ST and in Figure 90 for ASM. As shown in Figure 92, ASM is

much more efficient than ST, and can reach the convergence using 4-7 times shorter

simulations than ST. In addition, the efficiency of ST will not increase with the

number of trajectories after 200, while the efficiency of ASM keeps increasing with

number of trajectories up to 600. We think ideally the length of the seeding

234

simulations should lie in the major gap of the implied timescales, such that they are

longer than the slowest intra-macrostate equilibration time to minimize the model

error due to non-Markovian effects. In the current system, the minimum length of the

simulations (~ ) is indeed between 3rd ( ) and 4th ( ) slowest

implied timescales. There is evidence from the RNA hairpin example and previous

work on a water dewetting transition in a carbon nanotube (7) that these requirements

for the lag time may be relaxed for real systems, where the separation of timescales is

less evident than in the model system studied here. . Additionally, the number of

seeding simulations has to be big enough to reduce the statistical error to a satisfactory

level.

45 10 31.61 10 49.58 10

Figure 91. Populations computed from ASM simulations for four metastable states as a function of lag

time.

235

Figure 92. Number of steps taken to reach the convergence as a function of number of trajs.

236

BIBLIOGRAPHY

1. Schütte C, Fischer A, Huisinga W, & Deuflhard P (1999) A direct approach to conformational dynamics based on hybrid Monte Carlo. J Comput Phys 151:146–168.

2. Bowman GR, Huang X, & Pande VS (2010) Network models for molecular kinetics and their initial applications to human health. Cell Res 20:622-630.

3. Noe F & Fischer S (2008) Transition networks for modeling the kinetics of conformational change in macromolecules. Curr Opin Struct Biol 18:154-162.

4. Bowman GR, Beauchamp KA, Boxer G, & Pande VS (2009) Progress and challenges in the automated construction of Markov state models for full protein systems. J Chem Phys 131:124101.

5. Noe F, Schutte C, Vanden-Eijnden E, Reich L, & Weikl TR (2009) Constructing the equilibrium ensemble of folding pathways from short off-equilibrium simulations. Proc Natl Acad Sci U S A 106:19011-19016.

6. Chodera JD, Singhal N, Pande VS, Dill KA, & Swope WC (2007) Automatic discovery of metastable states for the construction of Markov models of macromolecular conformational dynamics. J Chem Phys 126:155101.

7. Sriraman S, Kevrekidis IG, & Hummer G (2005) Coarse nonlinear dynamics and metastability of filling-emptying transitions: Water in carbon nanotubes. Phys. Rev. Lett. 95:130603.

8. Gfeller D, De Los Rios P, Caflisch A, & Rao F (2007) Complex network analysis of free-energy landscapes. Proc Natl Acad Sci U S A 104:1817-1822.

9. Schutte C (1999) Conformational Dynamics: Modeling, Theory, Algorithm, and Application to Biomolecules. (thesis, Freie Universitat Berlin).

10. Bowman GR, Huang X, & Pande VS (2009) Using generalized ensemble simulations and Markov state models to identify conformational states. Methods 49:197-201.

11. Sriraman S, Kevrekidis LG, & Hummer G (2005) Coarse master equation from Bayesian analysis of replica molecular dynamics simulations. J Phys Chem B 109:6479-6484.

12. Huang X, Bowman GR, Bacallado S, & Pande VS (2009) Rapid equilibrium sampling initiated from nonequilibrium data. Proc Natl Acad Sci U S A 106:19765-19769.

13. Huang X, et al. (2010) Constructing multi-resolution Markov state models (MSMs) to elucidate RNA hairpin folding mechanisms. Pac Symp Biocomput 15:228-239.

237

14. Noe F, Horenko I, Schutte C, & Smith JC (2007) Hierarchical analysis of conformational dynamics in biomolecules: transition networks of metastable states. J Chem Phys 126:155102.

15. Sarich M, Noe F, & Schutte C (2010) On the approximation quality of Markov state models. SIAM Multiscale Model Simul, in press.

16. Bowman GR & Pande VS (2010) Protein folded states are kinetic hubs. Proc Natl Acad Sci U S A 107:10890-10895.

17. Rao F & Caflisch A (2004) The protein folding network. J Mol Biol 342:299-306.

18. Bowman GR, Ensign DL, & Pande VS (2010) Enhanced modeling via network theory: adaptive sampling of Markov state models. J Chem Theory Comput 6:787-794.

19. Hinrichs NS & Pande VS (2007) Calculation of the distribution of eigenvalues and eigenvectors in Markovian state models for molecular dynamics. J Chem Phys 126:244101.

20. Roblitz S (2008) Statistical error estimation and grid-free hierarchical refinement in conformation dynamics. (thesis, Freie Universitat Berlin).

21. Mitsutake A, Sugita Y, & Okamoto Y (2001) Generalized-ensemble algorithms for molecular simulations of biopolymers. Biopolymers 60:96-123.

22. Hansmann UH & Okamoto Y (1999) New Monte Carlo algorithms for protein folding. Curr. Opin. Struct. Biol. 9:177-183.

23. Sugita Y & Okamoto Y (1999) Replica-exchange molecular dynamics method for protein folding. Chem. Phys. Lett. 314:141-151.

24. Lyubartsev AP, Martsinovski AA, Shevkunov SV, & Vorontsov-Velyaminov PN (1992) New approach to Monte Carlo calculation of the free energy: Method of expanded ensembles. J. Chem. Phys. 96:1776-1783.

25. Marinari E & Parisi G (1992) Simulated Tempering: a New Monte Carlo Scheme. Euro. Lett. 19:451-458.

26. Zhou R, Berne BJ, & Germain R (2001) The free energy landscape for beta hairpin folding in explicit water. Proc. Natl. Acad. Sci. USA 98:14931-14936.

27. Rhee YM & Pande VS (2003) Multiplexed-replica exchange molecular dynamics method for protein folding simulation. Biophysical journal 84:775-786.

28. Nymeyer H & Garcia AE (2003) Simulation of the folding equilibrium of alpha-helical peptides: a comparison of the generalized Born approximation with explicit solvent. Proc. Natl. Acad. Sci. USA 100:13934-13939.

29. Zhou R (2003) Trp-cage: folding free energy landscape in explicit water. Proc. Natl. Acad. Sci. USA 100:13280-13285.

30. Krivov SV & Karplus M (2004) Hidden complexity of free energy surfaces for peptide (protein) folding. Proc. Natl. Acad. Sci. U.S.A. 101:14766-14770.

31. Karpen ME, Tobias DJ, & Brooks CL, 3rd (1993) Statistical clustering techniques for the analysis of long molecular dynamics trajectories: analysis of 2.2-ns trajectories of YPGDV. Biochemistry 32:412-420.

238

32. Shao JY, Tanner SW, Thompson N, & Cheatham TE (2007) Clustering molecular dynamics trajectories: 1. Characterizing the performance of different clustering algorithms. J. Chem. Theory Comp. 3:2312-2334.

33. Buchete NV & Hummer G (2008) Coarse master equations for peptide folding dynamics. J Phys Chem B 112:6057-6069.

34. Swope WC, Pitera JW, & Suits F (2004) Describing protein folding kinetics by molecular dynamics simulations. 1. Theory. J Phys Chem B 108:6571-6581.

35. Frauenfelder H, Sligar SG, & Wolynes PG (1991) The energy landscapes and motions of proteins. Science 254:1598-1603.

36. Yang WY & Gruebele M (2004) Detection-dependent kinetics as a probe of folding landscape microstructure. J Am Chem Soc 126:7758-7759.

37. Singhal N, Snow CD, & Pande VS (2004) Using path sampling to build better Markovian state models: predicting the folding rate and mechanism of a tryptophan zipper beta hairpin. J. Chem. Phys. 121:415-425.

38. Elmer S, Park S, & Pande VS (2005) Foldamer dynamics expressed via Markov State Models: 2. Explicit solvent molecular dynamics simulations in acetonitrile, chloroform, methanol, and water. J. Chem. Phys. 122:124908.

39. Jayachandran G, Vishal V, & Pande VS (2006) Folding Simulations of the Villin Headpiece in All-Atom Detail. J. Chem. Phys. 124:164902.

40. Kelley NW, Vishal V, Krafft GA, & Pande VS (2008) Simulating oligomerization at experimental concentrations and long timescales: A Markov state model approach. J Chem Phys 129:214707.

41. Gonzalez T (1985) Clustering to minimize the maximum intercluster distance. Theo. Comp. Sci. 38:293-306.

42. Dasgupta S & Long PM (2005) Performance guarantees for hierarchical clustering. J. Comput. System Sci. 70:555-569.

43. Lindahl E, B. Hess, and D. van der Spoel. (2001) GROMACS 3.0: a package for molecular simulation and trajectory analysis. J. Mol. Modeling. 7:306-317.

44. Deuflhard P & Weber M (2005) Robust Perron cluster analysis in conformation dynamics. Lin. Alg. Appl. 398:161-184.

45. Deuflhard P, Huisinga W, Fischer A, & Schütte C (2000) Identification of almost invariant aggregates in reversible nearly uncoupled Markov chains. Lin. Alg. Appl. 315:39-59.

46. Anfinsen CB, Haber E, Sela M, & White FH, Jr. (1961) The kinetics of formation of native ribonuclease during oxidation of the reduced polypeptide chain. Proc Natl Acad Sci USA 47:1309-1314.

47. Klein WL, Stine WB, Jr., & Teplow DB (2004) Small assemblies of unmodified amyloid beta-protein are the proximate neurotoxin in Alzheimer's disease. Neurobiol Aging 25:569-580.

48. Simons KT, Kooperberg C, Huang E, & Baker D (1997) Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol 268:209-225.

49. Bowman GR & Pande VS (2009) Simulated tempering yields insight into the low-resolution Rosetta scoring functions. Proteins 74:777-788.

239

50. Bolhuis PG, Dellago C, & Chandler D (2000) Reaction coordinates of biomolecular isomerization. Proc Natl Acad Sci U S A 97:5877-5882.

51. Du R, Pande VS, Grosberg AY, Tanaka T, & Shakhnovich ES (1998) On the transition coordinate for protein folding. J Chem Phys 108:34-350.

52. Bowman GR, et al. (2008) Structural insight into RNA hairpin folding intermediates. J Am Chem Soc 130:9676-9678.

53. Dill KA, Ozkan SB, Shell MS, & Weikl TR (2008) The protein folding problem. Annu Rev Biophys 37:289-316.

54. Chodera JD, Swope WC, Pitera JW, & Dill KA (2006) Long-timescale protein folding dynamics from short-time molecular dynamics simulations. Multi Mod Simul 5:1214–1226.

55. Yang S, Banavali NK, & Roux B (2009) Mapping the conformational transition in Src activation by cumulating the information from multiple molecular dynamics trajectories. Proc Natl Acad Sci U S A 106:3776-3781.

56. Kubelka J, Chiu TK, Davies DR, Eaton WA, & Hofrichter J (2006) Sub-microsecond protein folding. J Mol Biol 359:546-553.

57. Chiu TK, et al. (2005) High-resolution x-ray crystal structures of the villin headpiece subdomain, an ultrafast folding protein. Proc Natl Acad Sci USA 102:7517-7522.

58. Ensign DL, Kasson PM, & Pande VS (2007) Heterogeneity even at the speed limit of folding: large-scale molecular dynamics study of a fast-folding variant of the villin headpiece. J Mol Biol 374:806-816.

59. Berendsen HJC, Vanderspoel D, & Vandrunen R (1995) Gromacs - a Message-Passing Parallel Molecular-Dynamics Implementation. Computer Physics Communications 91:43-56.

60. Wang JM, Cieplak P, & Kollman PA (2000) How well does a restrained electrostatic potential (RESP) model perform in calculating conformational energies of organic and biological molecules? Journal of computational chemistry 21:1049-1074.

61. Ryckaert JP, Ciccotti G, & Berendsen HJC (1977) Numerical Integration of the Cartesian Equations of Motion of a System with Constraints: Molecular Dynamics of n-Alkanes. J. Comp. Phys. 23:327-341.

62. Miyamoto S & Kollman PA (1992) Settle - an Analytical Version of the Shake and Rattle Algorithm for Rigid Water Models. Journal of computational chemistry 13:952-962.

63. Hoover W (1985) Canonical dynamics: Equilibrium phase-space distributions. Phys. Rev. A 31:1695-1697.

64. Nose S & Klein ML (1983) Constant Pressure Molecular-Dynamics for Molecular-Systems. Molecular Physics 50:1055-1076.

65. Nose S (1984) A Molecular-Dynamics Method for Simulations in the Canonical Ensemble. Molecular Physics 52:255-268.

66. Parrinello M & Rahman A (1981) Polymorphic Transitions in Single-Crystals - a New Molecular-Dynamics Method. Journal of Applied Physics 52:7182-7190.

240

67. Humphrey W, Dalke A, & Schulten K (1996) VMD: visual molecular dynamics. J Mol Graph 14:33-38.

68. Schultheis V, Hirschberger T, Carstens H, & Tavan P (2005) Extracting Markov Models of Peptide Conformational Dynamics from Simulation Data. JCTC 1:515-526.

69. Bolhuis PG, Chandler D, Dellago C, & Geissler PL (2002) Transition path sampling: throwing ropes over rough mountain passes, in the dark. Annu Rev Phys Chem 53:291-318.

70. Dill KA, Ozkan SB, Weikl TR, Chodera JD, & Voelz VA (2007) The protein folding problem: when will it be solved? Curr Opin Struct Biol 17:342-346.

71. Plaxco KW, Simons KT, & Baker D (1998) Contact order, transition state placement and the refolding rates of single domain proteins. J Mol Biol 277:985-994.

72. Yang WY & Gruebele M (2003) Folding at the speed limit. Nature 423:193-197.

73. Kubelka J, Hofrichter J, & Eaton WA (2004) The protein folding 'speed limit'. Curr Opin Struct Biol 14:76-88.

74. Udgaonkar JB (2008) Multiple routes and structural heterogeneity in protein folding. Annu Rev Biophys 37:489-510.

75. Pitera JW & Swope W (2003) Understanding folding and design: replica-exchange simulations of "Trp-cage" miniproteins. Proc Natl Acad Sci U S A 100:7587-7592.

76. Zagrovic B, Snow CD, Shirts MR, & Pande VS (2002) Simulation of folding of a small alpha-helical protein in atomistic detail using worldwide-distributed computing. J Mol Biol 323:927-937.

77. Ensign DL & Pande VS (2009) The Fip35 WW domain folds with structural and mechanistic heterogeneity in molecular dynamics simulations. Biophys J 96:L53-55.

78. Horng JC, Moroz V, & Raleigh DP (2003) Rapid cooperative two-state folding of a miniature alpha-beta protein and design of a thermostable variant. J Mol Biol 326:1261-1270.

79. Shirts M & Pande VS (2000) COMPUTING: Screen Savers of the World Unite! Science 290:1903-1904.

80. Friedrichs MS, et al. (2009) Accelerating molecular dynamic simulation on graphics processing units. J Comput Chem 30:864-872.

81. Onufriev A, Bashford D, & Case DA (2004) Exploring protein native states and large-scale conformational changes with a modified generalized born model. Proteins 55:383-394.

82. Shell MS, Ritterson R, & Dill KA (2008) A test on peptide stability of AMBER force fields with implicit solvation. J Phys Chem B 112:6878-6886.

83. Hoffman DW, et al. (1994) Crystal structure of prokaryotic ribosomal protein L9: a bi-lobed RNA-binding protein. EMBO J 13:205-212.

84. Shirts MR & Pande VS (2001) Mathematical analysis of coupled parallel simulations. Phys Rev Lett 86:4983-4987.

241

85. Ensign DL & Pande VS (2009) Bayesian single-exponential kinetics in single-molecule experiments and simulations. J Phys Chem B 113:12410-12423.

86. Panchenko AR, Luthey-Schulten Z, & Wolynes PG (1996) Foldons, protein structural modules, and exons. Proc Natl Acad Sci U S A 93:2008-2013.

87. Metzner P, Schutte C, & Vanden-Eijnden E (2009) Transition Path Theory for Markov Jump Processes. Multiscale Modeling & Simulation 7:1192-1219.

88. Weikl TR (2008) Loop-closure principles in protein folding. Archives of Biochemistry and Biophysics 469:67-75.

89. Snow CD, Rhee YM, & Pande VS (2006) Kinetic definition of protein folding transition state ensembles and reaction coordinates. Biophys J 91:14-24.

90. Uversky VN (2009) Intrinsic disorder in proteins associated with neurodegenerative diseases. Front Biosci 14:5188-5238.

91. Bowman GR & Pande VS (2009) The roles of entropy and kinetics in structure prediction. PLoS One 4:e5840.

92. Ozkan SB, Wu GA, Chodera JD, & Dill KA (2007) Protein folding by zipping and assembly. Proc Natl Acad Sci U S A 104:11987-11992.

93. Voelz VA, Bowman GR, Beauchamp KA, & Pande VS (2010) Molecular simulation of ab initio protein folding for a millisecond folder NTL9(1-39). J Am Chem Soc 132:1526-1528.

94. Jackson SE & Fersht AR (1991) Folding of chymotrypsin inhibitor 2. 1. Evidence for a two-state transition. Biochemistry 30:10428-10435.

95. Bryngelson JD, Onuchic JN, Socci ND, & Wolynes PG (1995) Funnels, pathways, and the energy landscape of protein folding: a synthesis. Proteins 21:167-195.

96. Barrick D (2009) What have we learned from the studies of two-state folders, and what are the unanswered questions about two-state protein folding? Phys Biol 6:15001.

97. Spudich GM, Miller EJ, & Marqusee S (2004) Destabilization of the Escherichia coli RNase H kinetic intermediate: switching between a two-state and three-state folding mechanism. J Mol Biol 335:609-618.

98. Radford SE, Dobson CM, & Evans PA (1992) The folding of hen lysozyme involves partially structured intermediates and multiple pathways. Nature 358:302-307.

99. Kamagata K, Sawano Y, Tanokura M, & Kuwajima K (2003) Multiple parallel-pathway folding of proline-free Staphylococcal nuclease. J Mol Biol 332:1143-1153.

100. Ma H & Gruebele M (2006) Low barrier kinetics: dependence on observables and free energy surface. J Comput Chem 27:125-134.

101. Wales DJ & Scheraga HA (1999) Global optimization of clusters, crystals, and biomolecules. Science 285:1368-1372.

102. Wetlaufer DB (1973) Nucleation, rapid folding, and globular intrachain regions in proteins. Proc Natl Acad Sci U S A 70:697-701.

103. Myers JK & Oas TG (2001) Preorganized secondary structure as an important determinant of fast protein folding. Nat Struct Biol 8:552-558.

242

104. Krishna MM, Maity H, Rumbley JN, Lin Y, & Englander SW (2006) Order of steps in the cytochrome C folding pathway: evidence for a sequential stabilization mechanism. J Mol Biol 359:1410-1419.

105. Volk M, et al. (1997) Peptide Conformational Dynamics and Vibrational Stark Effects Following Photoinitiated Disulfide Cleavage. J Chem Phys 101:8607.

106. Sabelko J, Ervin J, & Gruebele M (1999) Observation of strange kinetics in protein folding. Proc Natl Acad Sci U S A 96:6031-6036.

107. Liu F & Gruebele M (2007) Tuning lambda6-85 towards downhill folding at its melting temperature. J Mol Biol 370:574-584.

108. Liu F, et al. (2009) A one-dimensional free energy surface does not account for two-probe folding kinetics of protein alpha(3)D. J Chem Phys 130:061101.

109. Ghosh K & Dill KA (2007) The ultimate speed limit to protein folding is conformational searching. J Am Chem Soc 129:11920-11927.

110. Betancourt MR & Onuchic JN (1995) Kinetics of protein like models: The energy landscape factors that determine folding. J Chem Phys 103:773.

111. Cho SS, Levy Y, & Wolynes PG (2006) P versus Q: structural reaction coordinates capture protein folding on smooth landscapes. Proc Natl Acad Sci U S A 103:586-591.

112. Leopold PE, Montal M, & Onuchic JN (1992) Protein folding funnels: a kinetic approach to the sequence-structure relationship. Proc Natl Acad Sci U S A 89:8721-8725.

113. Nettels D, Gopich IV, Hoffmann A, & Schuler B (2007) Ultrafast dynamics of protein collapse from single-molecule photon statistics. Proc Natl Acad Sci U S A 104:2655-2660.

114. Waldauer SA, et al. (2008) Ruggedness in the folding landscape of protein L. HFSP J 2:388-395.

115. Voelz VA, Singh VR, Wedemeyer WJ, Lapidus LJ, & Pande VS (2010) Unfolded state dynamics and structure of protein L characterized by simulation and experiment. J Am Chem Soc 132:4702-4709.

116. Watts DJ & Strogatz SH (1998) Collective dynamics of 'small-world' networks. Nature 393:440-442.

117. Barabasi AL & Albert R (1999) Emergence of scaling in random networks. Science 286:509-512.

118. Dill KA & Chan HS (1997) From Levinthal to pathways to funnels. Nat Struct Biol 4:10-19.

119. Milgram S (1967) The small world problem. Psychol Today 1:61-67. 120. Chung HS, Louis JM, & Eaton WA (2009) Experimental determination of

upper bound for transition path times in protein folding from single-molecule photon-by-photon trajectories. Proc Natl Acad Sci U S A 106:11837-11844.

121. Fersht AR (2002) On the simulation of protein folding by short time scale molecular dynamics and distributed computing. Proc Natl Acad Sci U S A 99:14122-14125.

243

122. Saven JG, Wang J, & Wolynes PG (1994) Kinetics of Protein-Folding - the Dynamics of Globally Connected Rough Energy Landscapes with Biases. J Chem Phys 101:11037-11043.

123. Wang J, Saven JG, & Wolynes PG (1996) Kinetics in a globally connected, correlated random energy model. J Chem Phys 105:11276-11284.

124. Du R, Pande VS, Grosberg AY, Tanaka T, & Shakhnovich ES (1999) On the role of conformational geometry in protein folding. J Chem Phys 111:10375.

125. Andrec M, Felts AK, Gallicchio E, & Levy RM (2005) Protein folding pathways from replica exchange simulations and a kinetic network model. Proc Natl Acad Sci U S A 102:6801-6806.

126. Kim PS & Baldwin RL (1990) Intermediates in the folding reactions of small proteins. Annu Rev Biochem 59:631-660.

127. Shan B, Eliezer D, & Raleigh DP (2009) The unfolded state of the C-terminal domain of the ribosomal protein L9 contains both native and non-native structure. Biochemistry 48:4707-4719.

128. Kuzmenkina EV, Heyes CD, & Nienhaus GU (2005) Single-molecule Forster resonance energy transfer study of protein dynamics under denaturing conditions. Proc Natl Acad Sci U S A 102:15471-15476.

129. McLeish TC (2005) Protein folding in high-dimensional spaces: hypergutters and the role of nonnative interactions. Biophys J 88:172-183.

130. Pabo CO & Lewis M (1982) The operator-binding domain of lambda repressor: structure and DNA recognition. Nature 298:443-447.

131. Clarke ND, Beamer LJ, Goldberg HR, Berkower C, & Pabo CO (1991) The DNA binding arm of lambda repressor: critical contacts from a flexible region. Science 254:267-270.

132. Huang GS & Oas TG (1995) Submillisecond folding of monomeric lambda repressor. Proc Natl Acad Sci U S A 92:6878-6882.

133. Burton RE, Huang GS, Daugherty MA, Calderone TL, & Oas TG (1997) The energy landscape of a fast-folding protein mapped by Ala-->Gly substitutions. Nat Struct Biol 4:305-310.

134. Ghaemmaghami S, Word JM, Burton RE, Richardson JS, & Oas TG (1998) Folding kinetics of a fluorescent variant of monomeric lambda repressor. Biochemistry 37:9179-9185.

135. Liu F, Gao YG, & Gruebele M (2010) A survey of lambda repressor fragments from two-state to downhill folding. J Mol Biol 397:789-798.

136. Larios E, Pitera JW, Swope W, & Gruebele M (2006) Correlation of early orientational ordering of engineered λ6–85 structure with kinetics and thermodynamics Chem Phys 323:45-53.

137. Yang WY & Gruebele M (2004) Folding lambda-repressor at its speed limit. Biophys J 87:596-608.

138. Allen LR, Krivov SV, & Paci E (2009) Analysis of the free-energy surface of proteins from reversible folding simulations. PLoS Comput Biol 5:e1000428.

244

139. Yang WY, Larios E, & Gruebele M (2003) On the extended beta-conformation propensity of polypeptides at high temperature. J Am Chem Soc 125:16220-16227.

140. Hoffmann A, et al. (2007) Mapping protein collapse with single-molecule fluorescence and kinetic synchrotron radiation circular dichroism spectroscopy. Proc Natl Acad Sci U S A 104:105-110.

141. DeCamp SJ, Naganathan AN, Waldauer SA, Bakajin O, & Lapidus LJ (2009) Direct observation of downhill folding of lambda-repressor in a microfluidic mixer. Biophys J 97:1772-1777.

142. Ma H & Gruebele M (2005) Kinetics are probe-dependent during downhill folding of an engineered lambda6-85 protein. Proc Natl Acad Sci U S A 102:2283-2287.

143. Munoz V & Serrano L (1994) Elucidating the folding problem of helical peptides using empirical parameters. Nat Struct Biol 1:399-409.

144. Portman J, Takada S, & Wolynes PG (1998) Variational Theory for Site Resolved Protein Folding Free Energy Surfaces. Phys Rev Lett 81:5237-5240.

145. Burton RE, Myers JK, & Oas TG (1998) Protein Folding Dynamics: Quantitative Comparison between Theory and Experiment. Biochemistry 37:5337–5343.

146. Pande VS (2010) A simple theory of protein folding kinetics. Phys Rev Lett, in submssion.

147. Liu F, et al. (2008) An experimental survey of the transition between two-state and downhill protein folding scenarios. Proc Natl Acad Sci U S A 105:2369-2374.

148. He Y, Yeh DC, Alexander P, Bryan PN, & Orban J (2005) Solution NMR structures of IgG binding domains with artificially evolved high levels of sequence identity but different folds. Biochemistry 44:14055-14061.

149. Rhee YM & Pande VS (2006) On the role of chemical detail in simulating protein folding kinetics. J Chem Phys 323:66-77.

150. Bradley P, Misura KM, & Baker D (2005) Toward high-resolution de novo structure prediction for small proteins. Science 309:1868-1871.

151. Das R, et al. (2007) Structure prediction for CASP7 targets using extensive all-atom refinement with Rosetta@home. Proteins 69:118-128.

152. Klepeis JL, Lindorff-Larsen K, Dror RO, & Shaw DE (2009) Long-timescale molecular dynamics simulations of protein structure and function. Curr Opin Struct Biol 19:120-127.

153. Geyer CJ (1992) Practical Markov Chain Monte Carlo. Stat. Sci. 7:473-511. 154. King RD, et al. (2009) The automation of science. Science 324:85-89. 155. Pande VS, et al. (2003) Atomistic protein folding simulations on the

submillisecond time scale using worldwide distributed computing. Biopolymers 68:91-109.

156. Faradjian AK & Elber R (2004) Computing time scales from reaction coordinates by milestoning. J Chem Phys 120:10880-10889.

245

157. Rogal J & Bolhuis PG (2008) Multiple state transition path sampling. J Chem Phys 129:224107

158. MacKay DJC (2003) Information theory, inference, and learning algorithms (Cambridge University Press, Cambridge, UK ; New York) p 34.

159. Shell MS (2008) The relative entropy is fundamental to multiscale and inverse thermodynamic problems. J. Chem. Phys. 129:144108

160. Cover TM & Thomas JA (2006) Elements of information theory (Wiley-Interscience, Hoboken, N.J.) 2nd Ed pp xxiii, 748 p.

161. Singhal N & Pande VS (2005) Error analysis and efficient sampling in Markovian state models for molecular dynamics. J Chem Phys 123:204909.

162. Baker D (2006) Prediction and design of macromolecular structures and interactions. Philos Trans R Soc Lond B Biol Sci 361:459-463.

163. Misura KM & Baker D (2005) Progress and challenges in high-resolution refinement of protein structure models. Proteins 59:15-29.

164. Schueler-Furman O, Wang C, Bradley P, Misura K, & Baker D (2005) Progress in modeling of protein structures and interactions. Science 310:638-642.

165. Kuhlman B, et al. (2003) Design of a novel globular protein fold with atomic-level accuracy. Science 302:1364-1368.

166. Kortemme T, et al. (2004) Computational redesign of protein-protein interaction specificity. Nat Struct Mol Biol 11:371-379.

167. Ashworth J, et al. (2006) Computational redesign of endonuclease DNA binding and cleavage specificity. Nature 441:656-659.

168. Nauli S, Kuhlman B, & Baker D (2001) Computer-based redesign of a protein folding pathway. Nat Struct Biol 8:602-605.

169. Nauli S, et al. (2002) Crystal structures and increased stabilization of the protein G variants with switched folding pathways NuG1 and NuG2. Protein Sci 11:2924-2931.

170. Qian B, et al. (2007) High-resolution structure prediction and the crystallographic phase problem. Nature 450:259-264.

171. Rothlisberger D, et al. (2008) Kemp elimination catalysts by computational enzyme design. Nature 453:190-195.

172. Simons K, et al. (1999) Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins. Proteins 34:82–95.

173. Shortle D, Simons K, & Baker D (1998) Clustering of low-energy conformations near the native structures of small proteins. Proc Natl Acad Sci USA 95:11158–11162.

174. Lee M, Tsai J, Baker D, & PA K (2001) Molecular dynamics in the endgame of protein structure prediction. J Mol Biol 313:417–430.

175. Chivian D, et al. (2005) Prediction of CASP6 structures using automated robetta protocols. Proteins 61:157–166.

176. Rohl C, Strauss C, Misura K, & Baker D (2004) Protein structure prediction using rosetta. Meth Enzymol 383:66–93.

246

177. Huang X, Bowman GR, & Pande VS (2008) Convergence of folding free energy landscapes via application of enhanced sampling methods in a distributed computing environment. J Chem Phys 128:205106.

178. McGuffin LJ, Bryson K, & Jones DT (2000) The PSIPRED protein structure prediction server. Bioinformatics 16:404-405.

179. Meiler J, Muller M, Zeidler A, & Schmaschke F (2001) Generation and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks. Journal of Molecular Modeling 7:360-369.

180. Karplus K & Hu BR (2001) Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics 17:713-720.

181. Ouali M & King RD (2000) Cascaded multiple classifiers for secondary structure prediction. Protein Science 9:1162-1176.

182. Bystroff C, Simons KT, Han KF, & Baker D (1996) Local sequence-structure correlations in proteins. Current Opinion in Biotechnology 7:417-421.

183. Engh RA & Huber R (1991) Accurate Bond and Angle Parameters for X-Ray Protein-Structure Refinement. Acta Crystallographica Section A 47:392-400.

184. Neria E, Fischer S, & Karplus M (1996) Simulation of activation free energies in molecular systems. Journal of Chemical Physics 105:1902-1921.

185. Dunbrack RL & Cohen FE (1997) Bayesian statistical analysis of protein side-chain rotamer preferences. Protein Science 6:1661-1681.

186. Lazaridis T & Karplus M (1999) Effective energy function for proteins in solution. Proteins-Structure Function and Genetics 35:133-152.

187. Kortemme T, Morozov AV, & Baker D (2003) An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein-protein complexes. J Mol Biol 326:1239-1259.

188. Morozov AV, Kortemme T, Tsemekhman K, & Baker D (2004) Close agreement between the orientation dependence of hydrogen bonds observed in protein structures and quantum mechanical calculations. Proc Natl Acad Sci USA 101:6946-6951.

189. Park S & Pande VS (2007) Choosing weights for simulated tempering. Phys Rev E Stat Nonlin Soft Matter Phys 76:016703.

190. Shirts M & Chodera J (2008) Statistically optimal analysis of samples from multiple equilibrium states. J Chem Phys 129:124105.

191. Kumar S, Bouzida D, Swendsen RH, Kollman PA, & Rosenberg JM (1992) The Weighted Histogram Analysis Method for Free-Energy Calculations on Biomolecules .1. The Method. J Comp Chem 13:1011-1021.

192. Noble MEM, Musacchio A, Saraste M, Courtneidge SA, & Wierenga RK (1993) Crystal-Structure of the Sh3 Domain in Human Fyn - Comparison of the 3-Dimensional Structures of Sh3 Domains in Tyrosine Kinases and Spectrin. Embo Journal 12:2617-2624.

193. Derrick JP & Wigley DB (1994) The third IgG-binding domain from streptococcal protein G. An analysis by X-ray crystallography of the structure alone and in a complex with Fab. J Mol Biol 243:906-918.

247

194. Cornilescu G, Marquardt JL, Ottiger M, & Bax A (1998) Validation of protein structure from anisotropic carbonyl chemical shifts in a dilute liquid crystalline phase. Journal of the American Chemical Society 120:6836-6837.

195. Heurgue-Hamard V, et al. (2006) The zinc finger protein Ynr046w is plurifunctional and a component of the eRF1 methyltransferase in yeast. Journal of Biological Chemistry 281:36140-36148.

196. Yang JS, Chen WW, Skolnick J, & Shakhnovich EI (2007) All-atom ab initio folding of a diverse set of proteins. Structure 15:53-63.

197. Yang JS, Wallin S, & Shakhnovich EI (2008) Universality and diversity of folding mechanics for three-helix bundle proteins. Proc Natl Acad Sci U S A 105:895-900.

198. Moult J (2005) A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Curr Opin Struct Biol 15:285-289.

199. Das R & Baker D (2008) Macromolecular modeling with rosetta. Annu Rev Biochem 77:363-382.

200. Shmygelska A & Levitt M (2009) Generalized ensemble methods for de novo structure prediction. Proc Natl Acad Sci U S A.

201. Sugita Y, Kitao A, & Okamoto Y (2000) Multidimensional replica-exchange method for free-energy calculations. J Chem Phys 113:6042-6051.

202. Neale C, Rodinger T, & Pomès R (2008) Equilibrium exchange enhances the convergence rate of umbrella sampling Chem Phys Lett 460:375–381.

203. Rao F & Caflisch A (2003) Replica exchange molecular dynamics simulations of reversible folding. J Chem Phys 119:4035-4042.

204. Clarke ND, Kissinger CR, Desjarlais J, Gilliland GL, & Pabo CO (1994) Structural studies of the engrailed homeodomain. Protein Sci 3:1779-1787

205. Tsai CJ, Maizel JV, & Nussinov R (2000) Anatomy of protein structures: Visualizing how a one-dimensional protein chain folds into a three-dimensional shape. Proc Natl Acad Sci USA 97:12038-12043.

206. Haspel N, Tsai CJ, Wolfson H, & Nussinov R (2003) Reducing the computational complexity of protein folding via fragment folding and assembly. Protein Sci 12:1177-1187.

207. Kifer I, Nussinov R, & Wolfson HJ (2008) Constructing templates for protein structure prediction by simulation of protein folding pathways. Proteins 73:380-394.

208. Uhlenbeck OC (1990) Tetraloops and RNA folding. Nature 346:613-614. 209. Jucker FM, Heus HA, Yip PF, Moors EH, & Pardi A (1996) A network of

heterogeneous hydrogen bonds in GNRA tetraloops. J Mol Biol 264:968-980. 210. Woese CR, Winker S, & Gutell RR (1990) Architecture of ribosomal RNA:

constraints on the sequence of "tetra-loops". Proc Natl Acad Sci USA 87:8467-8471.

211. Varani G (1995) Exceptionally stable nucleic acid hairpins. Annual review of biophysics and biomolecular structure 24:379-404.

248

212. Marino JP, Gregorian RS, Csankovszki G, & Crothers DM (1995) Bent helix formation between RNA hairpins with complementary loops. Science 268:1448-1454.

213. Pley HW, Flaherty KM, & McKay DB (1994) Model for an RNA tertiary interaction from the structure of an intermolecular complex between a GAAA tetraloop and an RNA helix. Nature 372:111-113.

214. Glück A, Endo Y, & Wool IG (1992) Ribosomal RNA identity elements for ricin A-chain recognition and catalysis. Analysis with tetraloop mutants. J Mol Biol 226:411-424.

215. Ansari A & Kuznetsov SV (2005) Is hairpin formation in single-stranded polynucleotide diffusion-controlled? The journal of physical chemistry B 109:12982-12989.

216. Roth A, et al. (2007) A riboswitch selective for the queuosine precursor preQ1 contains an unusually small aptamer domain. Nat Struct Mol Biol 14:308-317.

217. Sorin EJ, Rhee YM, & Pande VS (2005) Does water play a structural role in the folding of small nucleic acids? Biophys J 88:2516-2524.

218. Kannan S & Zacharias M (2007) Folding of a DNA hairpin loop structure in explicit solvent using replica-exchange molecular dynamics simulations. Biophys J 93:3218-3228.

219. Garcia AE & Paschek D (2008) Simulation of the pressure and temperature folding/unfolding equilibrium of a small RNA hairpin. J Am Chem Soc 130:815-817.

220. Ansari A, Kuznetsov SV, & Shen Y (2001) Configurational diffusion down a folding funnel describes the dynamics of DNA hairpins. Proc Natl Acad Sci USA 98:7771-7776.

221. Jung J & Van Orden A (2006) A three-state mechanism for DNA hairpin folding characterized by multiparameter fluorescence fluctuation spectroscopy. J Am Chem Soc 128:1240-1249.

222. Ma H, Wan C, Wu A, & Zewail AH (2007) DNA folding and melting observed in real time redefine the energy landscape. Proc Natl Acad Sci USA 104:712-716.

223. Ma H, et al. (2006) Exploring the energy landscape of a small RNA hairpin. J Am Chem Soc 128:1523-1530.

224. Hagen M, Kim B, Liu P, Friesner RA, & Berne BJ (2007) Serial replica exchange. in J Phys Chem B), pp 1416-1423.

225. Menger M, Eckstein F, & Porschke D (2000) Dynamics of the RNA hairpin GNRA tetraloop. in Biochemistry-Us), pp 4500-4507.

226. Zhao L & Xia T (2007) Direct revelation of multiple conformations in RNA by femtosecond dynamics. J Am Chem Soc 129:4118-4119.

227. G. Singh FMaGC (Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. Eurographics Symposium on Point-Based Graphics.

228. Yao Y, et al. (2009) Topological methods for exploring low-density states in biomolecular folding pathways. J Chem Phys 130:144115.

249

229. Kim J, Doose S, Neuweiler H, & Sauer M (2006) The initial step of DNA hairpin folding: a kinetic analysis using fluorescence correlation spectroscopy. in Nucleic Acids Res), pp 2516-2527.

230. Pitera JW, Haque I, & Swope WC (2006) Absence of reptation in the high-temperature folding of the trpzip2 beta-hairpin peptide. The Journal of chemical physics 124:141102.

231. Zhang W & Chen SJ (2002) RNA hairpin-folding kinetics. Proc Natl Acad Sci U S A 99:1931-1936.

232. Mohanty S & Hansmann UH (2006) Folding of proteins with diverse folds. Biophys J 91:3573-3578.

233. Liu P, Huang X, Zhou R, & Berne BJ (2006) Hydrophobic aided replica exchange: an efficient algorithm for protein folding in explicit solvent. J Phys Chem B 110:19018-19022.

234. Im W & Brooks CL (2004) De novo folding of membrane proteins: An exploration of the structure and NMR properties of the fd coat protein. Journal of Molecular Biology 337:513-519.

235. Roitberg AE, Okur A, & Simmerling C (2007) Coupling of replica exchange simulations to a non-Boltzmann structure reservoir. J Phys Chem B 111:2415-2418.

236. Pitera JW, Swope WC, & Abraham FF (2008) Observation of noncooperative folding thermodynamics in simulations of 1BBL. Biophysical journal 94:4837-4846.

237. Zhang W, Wu C, & Duan Y (2005) Convergence of replica exchange molecular dynamics. J Chem Phys 123:154105.

238. Periole X & Mark AE (2007) Convergence and sampling efficiency in replica exchange simulations of peptide folding in explicit solvent. J Chem Phys 126:014903.

239. Nymeyer H (2008) How efficient is replica exchange molecular dynamics? An analytic approach J. Chem. Theory Comput. 4:626–636.

240. Zuckerman DM & Lyman E (2006) A Second Look at Canonical Sampling of Biomolecules Using Replica Exchange Simulation. J. Chem. Theory Comput. 2:1200-1202.

241. Zheng W, Andrec M, Gallicchio E, & Levy RM (2008) Simple continuous and discrete models for simulating replica exchange simulations of protein folding. J Phys Chem B 112:6083-6093.

242. Zheng W, Andrec M, Gallicchio E, & Levy RM (2007) Simulating replica exchange simulations of protein folding with a kinetic network model. Proc Natl Acad Sci U S A 104:15340-15345.

243. Sanbonmatsu KY & Garcia AE (2002) Structure of Met-enkephalin in explicit aqueous solution using replica exchange molecular dynamics. Proteins 46:225-234.

244. Nadler W & Hansmann UH (2007) Dynamics and optimal number of replicas in parallel tempering simulations. Phys Rev E Stat Nonlin Soft Matter Phys 76:065701.

250

245. Nadler W & Hansmann UH (2007) Optimizing replica exchange moves for molecular dynamics. Phys Rev E Stat Nonlin Soft Matter Phys 76:057102.

246. Hummer G & Kevrekidis IG (2003) Coarse molecular dynamics of a peptide fragment: Free energy, kinetics, and long-time dynamics computations. J Chem Phys 118:10762-10773.

247. Ytreberg FM & Zuckerman DM (2008) A black-box re-weighting analysis can correct flawed simulation data. Proc Natl Acad Sci U S A 105:7982-7987.

248. Levitt M (1972) Folding of nucleic acids. Ciba Found Symp 7:147-171. 249. Schutte C & Huisinga W (2003) Biomolecular conformations can be identified

as metastable sets of molecular dynamics. Handbook of numerical analysis:699-744.

250. Schutte C & Huisinga W (2000) Biomolecular conformations as metastable sets of Markov chains. Proceedings of the 18th Annual Allerton Conference on Communication, Control, and Computing:1106-1115.

251. Noe F (2008) Probability distributions of molecular observables computer from Markov models. J Chem Phys 128:244103.

252. Zwanzig R (1995) Simple-Model of Protein-Folding Kinetics. Proceedings of the National Academy of Sciences of the United States of America 92:9801-9804.

253. Brzezniak Z & Zastawniak T (1999) Basic stochastic processes : a course through exercises (Springer, London ; New York) pp x, 225 p.

254. Bacallado S, Chodera JD, & Pande V (2009) Bayesian comparison of Markov models of molecular dynamics with detailed balance constraint. J Chem Phys 131:045106.

255. Van der Spoel D, et al. (2005) GROMACS: Fast, flexible, and free. Journal of computational chemistry 26:1701-1718.

256. Still WC, Tempczyk A, Hawley RC, & Hendrickson T (1990) Semianalytical Treatment of Solvation for Molecular Mechanics and Dynamics. Journal of the American Chemical Society 112:6127-6129.

257. Lovell SC, et al. (2003) Structure validation by C alpha geometry: phi,psi and C beta deviation. Proteins-Structure Function and Genetics 50:437-450.

258. Berezhkovskii A, Hummer G, & Szabo A (2009) Reactive flux and folding pathways in network models of coarse-grained protein dynamics. J Chem Phys 130:205102.

259. Fersht AR (1997) Nucleation mechanisms in protein folding. Current Opinion in Structural Biology 7:3-9.

260. Karplus M & Weaver DL (1976) Protein-Folding Dynamics. Nature 260:404-406.

261. Weber M & Kube S (2005) Robust Perron Cluster Analysis for various applications in computational life science. Computational Life Sciences, Proceedings 3695:57-66.

262. Cornell WD, P. Cieplak, C. I. Bayly, I. R. Gould, K. M. Merz, D. M. & Ferguson DCS, T. Fox, J. W. Caldwell, and P. A. Kollman (1995) A second

251

generation force field for the simulation of proteins, nucleic acids, and organic molecules. J. Am. Chem. Soc. 117:5179-5197.

263. Jorgensen WL, Chandrasekhar J, Madura JD, Impey RW, & Klein ML (1983) Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 79:926-935.

264. Darden T, D. York, and L. Pedersen. (1995) A smooth particle mesh Ewald potential. J. Chem. Phys. 103:3014-3021.

265. Hess B, H. Bekker, H. J. C. Berendsen, and J. G. E. M. Fraaije. (1997) LINCS: a linear constraint solver for molecular simulations. J. Comput. Chem. 18:1463-1472.

266. Macke TJ & Case DA (1998) Modeling unusual nucleic acid structures. Molecular Modeling of Nucleic Acids 682:379-393.

267. DUAN Y, et al. (2003) A Point-Charge Force Field for Molecular Mechanics Simulations of Proteins Based on Condensed-Phase Quantum Mechanical Calculations. J. Comp. Chem. 24:1999-2012.

markov state models for protein and rna folding a ...ky974bm1455/gregory_r_bowm… · markov state...

Documents