general notes on computational biophysics

8/13/2019 General notes on computational biophysics

1/123

Introduction to computational biophysics (CS 428) (3 credit points)Instructor: Ron Elber ([email protected]) 5-7146

Pre-requisites: CS 100, MATH 293,294, Physics 112,213 CHEM 211 or equivalent.BioBM 330 recommended

Tuesday and Thursday , lecture: 1:25-2:15 Olin Hall 245Thursday, section: 2:30-3:20 Hollister Hall 306

Water:1. Atomistic simulations. Fixed charge models. TIP3P and TIP4P. Energy

minimization (steepest descent, conjugate gradient, Newton Raphson) and thegeometry of the water dimer.

2. Water entropy and free energy. Calculation of partition functions. (Stochasticsampling, randomized algorithms, Metropolis algorithm. and Markov chains).

3. Hydrophobic effects and solvation of apolar molecules. (Enhanced sampling,

multi-tempering, and multi-ensemble approaches).Protein Folding:

1. Reduced representation of polymers and simulations of polymer collapse. (Latticeand continuous Monte Carlo simulations).

2. Simulation of kinetics and equilibrium Brownian dynamics.3. Global optimization techniques, randomized algorithms, protein design

Molecular dynamics:1. Solving initial value problems. Extracting kinetic and thermodynamic properties.2. Molecular dynamics with holonomic constraints (SHAKE).3. Solvent and solutes, Periodic boundary conditions, pressure and temperature

controls,4. Computing long-range forces (Ewald sum).5. Correlation functions and experiments6. Transition state theory in the condensed phases

Statistics:1. Estimators: Mean, standard deviation2. Maximum likelihood3. Confidence interval4. chi-2 statistics5. Regression6. Goodness of fit

The students in the class must follow the code of academic integrityhttp://www.cs.cornell.edu/degreeprogs/ugrad/CSMajor/index.htm#ai


2/123

Water

A very simple molecule that consists of 3 atoms: one oxygen and two hydrogen atoms.Some remarkable properties.Very high melting and vaporization points for liquid withso light molecular mass. For example, CCl4 melt at 250K, while water at 273K.

Anomalous density behavior at freezing (expansion in density ice lighter than liquidwater, ice 0.92 g/mL, liquid water 1.0g/mL).Very strong electrical forces (specialorientation forces?). High dielectric constant

Explain with simple microscopic model macroscopic behavior? Explain alsomicroscopic data, since macroscopic observation are too few.

Structures Correlation functions spatial & time. Spatial correlation functions (first and second peak) as determinant of interaction

strengths

Time correlations measure (for example) dielectric response.A water molecule is neutral but it carriesa large dipole moment . The oxygen is highlynegative and the hydrogen positive. The dipole moment of the molecule (it is not linear)is 1.855 Debye unit (Debye = 3.3356410-30C m). Electron charge and distance of oneangstrom corresponds to 4.803 Debye.

The geometry of a single water molecule is determined by the following parameters(using a model potential TIP3P position of an atom as light as hydrogen in quantummechanics is subject to considerable uncertainty): r(OH) 0.9572 a(HOH) 109.47 Sincethe hydrogen atoms are symmetric, the dipole moment is an experimental indication that

the molecule is not linear. The individual molecules are set to be rigid (no violations ofthe above internal coordinates are allowed). We shall use harmonic restraints to fix thegeometry (at least until we will learn how to handle holonomic constraints). For internalwater potential, we write

( ) ( ) ( )2 2 2internal 1 2 1 20.9572 0.9572 ' 1.5139o h o h h hU k r k r k r = + +

wherek andk are constants chosen to minimize the changes in the distances compared tothe ideal values, and is chosen empirically to be 600 kcal/mol2.

The potential betweenn water molecules using the TIP3P model is

( ) ( )between molecules 12 6 , ,ik jl

i j i j k lij ij ik jl

c c A BU k

r oo r oo r > >

= +

the indices ,i j are for water molecules, the indices,k l are for the atoms of a singlemolecule. The first sum is over the oxygen atoms only and includes repulsion betweenthe different terms. This type of potential is also called Lennard Jones. The hydrogenatoms (deprived from electrons) are so small that their influence on the hard-core shape


3/123

of the molecule is ignored. The hard core is set to be spherical with the oxygen atom atthe center. Hydrogen atoms (and oxygen atoms too) are coming to play in the secondterm of electrostatic interactions.The potential parameters are as follows

12

6

0

629,400 kcal /mol625.5 kcal /mol

0.8340.417

332.1 H

A

B

c

c

k

=== =

=

To begin with we will be interested in the water dimer. The interaction between the twomolecules can be written more explicitly (below) as

1 02 1 21 1 22 2 11 2 12between molecules 12 6

1 2 1 2 1 2 1 21 1 22 1 21 2 12

11 21 11 22 12 21 12 22

11 21 11 22 12 21 12 22

[

]

o o h o h o h o h

o o o o o o o h o h o h o h

h h h h h h h h

h h h h h h h h

c c c c c c c c c c A BU k

r r r r r r r

c c c c c c c cr r r r

= + + + + +

+ + + +

For the first monomer we useo1 h11 & h12 to denote the corresponding three atoms. Forthe second monomer we used insteado2 h21 & h22 . Of course the modeling of theinteraction between molecules must come together with the internal potential keeping thegeometry fixed.The water dimer has an optimal geometry (a minimum energy configuration) that we nowwish to determine. This is a question of how to arrange negatively charged atoms close topositively charge atoms and keep similar charges away from each other.Together withthe fix geometry of the individual molecules this is a frustrated system (you cannotmake everybody happy a classic example is of three spins) that also has implications onthe docking problem and drug design (determining the orientation of two interactingmolecules) and the range of complementing interactions and shape matching

What is the dimensionality of the problem ? We have three atoms in each of the watermolecules making a total of eighteen degrees of freedom. The actual number relevant forthe docking is much smaller. Of the eighteen six are of translation and rotation of thewhole system (not changing the relative orientation) and another six are reduced to theinternal constraints on the structure on the structure of the molecule. Hence, only sixdegrees of freedom remain when we attempt to match the two (rigid) water moleculestogether.


4/123

This is possible but requires considerable work for implementation and makes itnecessary to write the energy in considerably less convenient coordinates (the mostconvenient are just the Cartesian coordinates of the individual atoms). So rather thanbeing clever at the very first step, we will use the simplest approach.In the simplestapproach we attempt to optimize the energy of the water dimer as a function of all

degrees of freedom . This requires more work from the computer, but less work from us.Minimizing the energy (and find the best structure) in this large 18 dimensional space issomething we cannot do by intuition and we must design an algorithm to do it.

TheAlgorithm of steepest descent is one way of doing it, and this is the first approachthat we are going to study. Consider a guess coordinate vector for the water dimer that wdenote by R (capitals denotes arrays or vectors, lower case denotes elements of a vector).It is a vector of length 18 and includes all the x,y,z coordinates. What is the best way oforganizing the coordinates? In some programs the vector of coordinates is decomposedinto three Cartesian vectors ( ) ( )1 2 18 1 18 1 18, , , ,... , ,..., , ,..., R X Y Z x x x y y z z = . This set up ishowever less efficient from the view point of memory architecture. When we computedistances between atoms (the main computation required for the energy calculations) werequire the (x,y,z) coordinates of the two atoms. By aggregating all the Cartesiancomponents together (e.g. all the x-s are coming first) we may create cache misses. Tofind the Cartesian components of one atom makes it necessary to walk quite far alongthe array, every time we need one of the Cartesian component we need to step throughn numbers to pick the next Cartesian component. This is not a serious problem for 18 atombut for hundred of thousands and millions of atoms (like some simulations are) it mightbe.A more efficient structure that we shall adopt is therefore:

( )1 1 1 18 18 18, , ,..., , , R x y z x y z= where the coordinates that belong to a single particle are kept together, making it lesslikely to have cache misses.

Starting from an initial configuration0 R we want to find a lower energy configuration.The argument is that low energy configurations are more likely to be relevant. This is truin general but is not always sufficient. We shall come back when discussing entropytemperature and how to simulate thermal fluctuations.

For the moment assume that we are going to make a small step a displacement

( )0 R R R = of norm ( )2

0i ii

R q q = . We have used the so-called norm 2distance measure which is very common in computational biophysics. We have also usedanother variableiq , which is any of the Cartesian coordinates of the system. For a systemwith N atoms, it has3N iq s . It is useful to know that more general formulation would

be ( )1

( )0

nnni i

i

q q = . These different distances emphasize alternative aspects of the


5/123


6/123

U R

U

=

We did not discuss so far how to choose the step length , except to argue that it shouldbe small. We expect that if the gradient is very small then we are near a minimum(actually a stationary point where 0U = ). In that case only a small step should be used.On the other hand if the gradient is large, a somewhat large step could be taken since weanticipate significant change in the function (to the better). Of course the step still cannotbe too large since our arguments are based on linear expansion of the function. Based onthe above arguments, it is suggestive to use the following (simpler) expression for thedisplacement that is directly proportional to the length of the gradient vector.

R U = We can imagine now a numerical minimization process that goes as follows:0. Init -- 0 R R= 1. Compute ( )U R 2. Compute a step R U = and a new coordinate set R R R = + 3. Check for convergence( )U If converged, stop. Otherwise return to 1.

This procedure finds for us a local minimum, a minimum to which we can slide downdirectly from an initial guess. It will not provide a solution to the global optimizationproblem, finding the minimum that is the lowest of them all.We can take the above expression one step further and write it down as a differentialequation as a function of a progress variable . This is not the most efficient way ofminimizing the structure under consideration but it will help us understand the processbetter. We writedR

U d = where the (dummy) variable is varying from zero to . At we approach thenearby stationary point. Not that the potential( )( )U R is a monotonically non-increasing function of . This is easy to show as follows. Multiple both side of the

equation byt

dRd

, we havet t

t

i

i i

dR dR dRU

d d d

dqdR dU U d d dq

=

=

Assuming that the potential does not have explicit dependence of (i.e. that is morethan a dummy variable which contradicts our initial assumptions) we can write the finalexpression in a more compact and illuminating form


7/123

0

0

t dR dR dU d d d

dU d

=

Hence as we progress the solution of the differential equation the potential energy isdecreasing in the best case and is not increasing in the worst case. There can be differentvariants of how to choose the norm of the step - , we already mentioned two of them.Another important variant is to make the step size parameter and to optimize it for agiven search direction defined byU . One approach would be to performa search for aminimum along the line defined by the U , i.e. we seek a such that

( ) 0U R U + = as a function of the single scalar variable . This one-dimensionalminimization makes sense if the calculation of the gradient can be avoided in the one-dimensional minimization, and search steps in the one-dimensional minimization can becomputed more efficiently than the determination of the direction. For example, in the

one-dimensional optimization only calculation of the energy( )U R U + can be usedand the approach of interval halving. We guess a given0 and if the energy is going upwe try a half of the previous size0 2 , if it is going down we double it. We continue toevaluate the progress of halving an interval that contains a minimum until we hit aminimum with desired properties (halving on the left or halving on the right results in anincreasing energy).This scheme assumes that the calculation of the potential energyis a lot more efficient than the calculation of the gradient . This is however not thecase here and for the task at hand the line search option of the steepest descent algorithmis not efficient.

We note that compared to other optimization algorithms (such as conjugate gradient thatwe shall not discuss, and the Newton-Raphson algorithm that we will) the steepestdescent algorithm is considerably slower. However, it is a lot more stable than (forexample) the Newton Raphson approach. It is a common practice in molecularsimulations to start with a crude minimizer like the Steepest Descent algorithmdescribed above to begin with (if the initial structure is pretty bad), and then to refine thecoordinate to perfection using something like Newton Raphson algorithm, which is onour agenda.

A few words about computing potential derivatives (which is of prime importance forminimization algorithms in high dimension. It is unheard of having effective minimizers

in high dimensions without derivatives:Overall, the potential derivatives are dominated by calculations of distances. It istherefore useful to consider the derivative of a distance between two particles, as afunction of the Cartesian coordinates.


8/123

( ) ( ) ( )( )

( ) ( ) ( )

2 2 2

2 2 2 , ,

ij i j i j i j

i jij

ii j i j i j

r x x y y z z

w wdr w x y z

dw x x y y z z

= + +

= =

+ +

It is therefore straightforward to do all kind of derivatives with extensive use of the chainrule. For example the Lennard Jones term

( )

( )

12 6 13 7 13 7

14 8

12 6 12 6

12 6

k jkj

i j j k j k k ij ij kj kj k kj kj kj

k j j k kj kj

w wdr d A B A B A Bdw r r r r dw r r r

A Bw w

r r

>

= + = +

= +

Note that the expression above depends only on even power of the distance (14 and 8)which is good news, meaning that no square roots are required. Square root are the curseof simulations and are much more expensive to compute compare to add/multiply etc.Unfortunately, electrostatic interactions are more expensive to compute. The potentialand its derivatives require square root calculations. And here is the expression for a set ofindependent atoms (for convenience we forgot about molecules here)

( ) ( )2 3, , , ,

i j i j i jk ik i

i j i j i jk i j i k i k i k

c c c c c cw wd w w

dw r r r r >

= =

Here is a little tricky question. To compute the energy, (which is a single scalar),for N atoms we need to calculate2 2 N terms (assuming all unique distancescontribute to the energy), and then add them up. How many terms we need tocompute for the gradient of the potential?

Calculations of potential gradients are a major source of errors in programming moleculamodeling code. It is EXTREMELY useful to check the analytical derivatives againstnumerical derivatives computed by finite difference, when the expected accuracy shouldbe of at least a few digits. For example

( ) ( )1 1,..., 2,..., ,..., 2,..., k N k N k

U q q q U q q qdU k

dq

+

0

For the water system the step can be 610 (using double precision) and the expectedaccuracy is at least of 3-4 digits. More than that suggest an error.

This concludes (in principle) our discussion of thesteepest descent minimizationalgorithm. From the above discussion it is quite clear that minimization in theneighborhood of a stationary point of the potential (like a minimum) is difficult .


9/123

Near the minimum the gradient of the potential is close to zero, subject to potentialnumerical error and support the use of only extremely small step that are harder toconverge numerically. In that sense the algorithm we discuss next is complementary tothe steepest descent approach. It is working well in the neighborhood of a minimum. It isnot working so well if the starting structure is very far from a minimum, since the large

step taken by this algorithm (Newton Raphson) relies on the correctness of thequadratic expansion of the potential energy surface near a minimum .

We start by considering a linear expansion of the potential derivatives at the currentposition R in the neighbor hood of the desired minimumm R . At the minimum, thegradient is (of course) zero. We have

( ) ( ) ( )2

0m i im j ji j i

d U U R U R q q

dq dq = + 0

The entity that we wish to determine from the above equation ism R the position of the

minimum. The square bracket with a subscript[ ]... j denotes the j vector element. The lastterm is a multiplication of matrix by a vector. From now onwards we shall denote thesecond derivative matrix byU # . We are attempting to do so by expanding the forcelinearly in the neighbor hood of the current point. This is one step up in expansioncompared to the steepest descent minimization; however, it is not sufficient in general. Inthe simplest version of the Newton Raphson (NR) approach very large steps are allowed.Large steps can clearly lead to problems if the linear expansion is not valid. Neverthelessthe expansion is expected to be valid if we are close to a minimum, since any function inthe neighborhood of a minimum can be expanded (accurately) up to a second order term

( ) ( ) ( ) ( ),

1

2

t

m m mi j

U R U R R R U R R + # Note that we did not write down the first order derivatives (gradient) since they are zeroin at the minimum. This is one clear advantage of NR with respect to steepest descentminimizer (SDM). SDM relies on the first derivatives only, derivatives that vanish in theneighborhood of a minimum. It is therefore difficult for SDM to make progress in theecircumstances while NR can do it in one single shot as we see blow.We write again the equation for the gradient in a matrix form

( ) ( ) ( ) ( )0m mU R U R U R R R = + #0

which we can formally solve (form R ) as

( )( ) ( )1

m R R U R U

= #

The matrix( )1

U # is the inverse of the matrixU # , namely( )

1U U I

=# # where I is the

identity matrix. In principle there is nothing in the above equation that determine the


10/123

norm of the step m R R that we should take. If the system is close the quadratic thesecond derivative matrix is roughly a constant and a reasonably large step to ward theminimum can be taken without violating significantly validity of the above equation. Theability to take a large step in quadratic like system is a clear advantage of NR compare toSDM. However for systems that are not quadratic the matrix is not a constant and onlysmall steps (artificially enforced) should be used. In our water system NR should be usedwith care and only sufficiently close to a minimum. There are a few technical points thatspecifically should concern us with the water dimer optimization problem. We will beconcern with the following

1. Does the matrixU # has an inverse, and what can we do if it does not?2. How to find the inverse (or solve the above linear equations)?

The bad news is that the matrixU # for molecular systems (as the water dimer is) does nothave in general an inverse. The problem is the six degrees of freedom that we mentionedearlier and do not affect the potential energy: three overall translation and three overallrotations have zero eigenvalues making the inverse singular. This is easy to see asfollows

Let the eigenvectors of theU # be ie and the eigenvaluesi . It is possible to write thematrixU # as the following sum

t i i i

i

U e e = # where the outer product

( )

1 1 1 1 2 1

2 2 1 2 2 21 2

1 2

...

......

... ... ... ... ......

i i i i i i iN

i i i i i i iN t i i i i iN

iN iN i iN i iN iN

e e e e e e e

e e e e e e ee e e e e

e e e e e e e

= =

generates a symmetric NxN matrix. Since the vectorsie are orthogonal to each other( i j ije e = ), it is trivial to write down the inverse in this case

( )1 t

i i

i i

e eU

= # However, this expression is true only if alli are different from zero. The eigenvectorsrepresent directions of motion and the eigenvalues are associated with the cost in energyfor moving in a direction determined by the corresponding eigenvector. However moving

along the direction of global translation (or global rotation) does not change the energy,therefore their corresponding eigenvalues must be equal to zero, and the inverseimpossible to get.

One way of getting around this problem is by shifting the eigenvalues of the offendingeigenvectors. If we know in advance the six offending eigenvectors we can raise theireigenvalues to very high values (instead of zero) by adding to the matrix outer productsof these vectors multiplied by very high value (see below). Contribution of eigenvectors


11/123

with very high value will diminish when we compute the inverse, since the inverse isobtained by dividing by the corresponding eigenvalues. We do not have to find all theeigenvectors and the eigenvalues as is written above, it is sufficient if we affect the fewspecific eigenvectors and define a new matrixU # to obtain a well behaved( )

1U

# . Thequestion remains (of course) is how to find these six eigenvectors. The moststraightforward (and inefficient) way to do it is to actually compute the eigenvectors andeigenvalues by a matrix diagonalization procedure. Lucky we do not have to do that sincthe translation and rotation eigenvectors are known from the Eckart conditions. We havefor translation

( ) ( )0 =0 , ,i i i i i i ii

m r r r x y z = and for rotation

( )0 0i i i ii

m r r r = Mention the possibility of using Lagrange multipliers here

The last multiplication is a vector product, and the difference0i ir r is assumed to besmall. The coordinate vectors0ir are reference vectors used to define the coordinatesystem and are constants.

For the record, we write the vector product explicitly

( ) ( ) ( )0 0 0 0 0 0 00 0 0

x y z

i i i i i x i i i i y i i i i z i i i i

i i i

e e e

r r x y z e y z z y e x z z x e x y y x

x y z

= = +

The two equations defined six constraints. This is since they are vector equations, each o

the vectors has 3 components x,y,z. To obtain the eigenvectors associated with theseconstraints we need to compute the gradients of the above constraints. We have threevectors for translations that we denote by , ,tx ty tze e e , and three vectors for the rotation

, ,rx ry rze e e .Here are the translation vectors

( )( )( )

1 2

1 2

1 2

,0,0, ,0,0,..., ,0,0 0, ,0,0, ,0,...,0, ,0 0,0, ,0, 0, ,..., 0,0,

tx N

ty N

tz N

e m m m

e m m m

e m m m

=

=

=

And here are the rotations.( )( )( )

0 0 0 0 0 01 1 1 1 2 2 2 2

0 0 0 0 0 01 1 1 1 2 2 2 2

0 0 0 0 0 01 1 1 1 2 2 2 2

0, , ,0, , ,...,0, ,

,0, , ,0, ,..., ,0,

, ,0, , ,0,..., , ,0

rx N N N N

ry N N N N

rz N N N N

e m z m y m z m y m z m y

e m z m x m z m x m z m x

e m y m x m y m x m y m x

=

=

=


12/123


13/123

close to a minimum such that all the non-zero eigenvalues are positive. In the case thatthe quadratic expansion is accurate (and in sharp contrast to SDM) the minimization willconverge in one step.

What will happen to our solution if the eigenvalues are negative?

So far we made an important step forward establishing that the inverse of the adjustedsecond derivative matrix is likely to exist. There remains the problem of how todetermine the inverse efficiently.

In fact we do not need to compute an inverse explicitly since all we need is to solve alinear equation of the type Ax b= where A and b are known matrix and vector, and x isthe unknown vector that we seek. We start with a subset of problem that is easy tounderstand and to solve (triangular problems) and then we work our way up to the fullGaussian elimination.

Triangular problems

Example11 1 1

21 22 2 2

31 32 33 3 3

0 00

a x b

a a x b

a a a x b

=

which is rather easy to solve. We can immediately write1 1 11 x b a= . Using the (now)known value of1 x we can write for2 x , ( )2 2 21 1 22 x b a x a= . Similarly we can write for

3 x , ( )3 3 31 1 32 2 33 x b a x a x a=

For the general case we can write an implicit solution (in terms of the earlier1,..., 1 j x j i= )

1

1

i

i i ij j ii j

x b a x a

=

=

Note that a similar procedure applied to the upper triangular matrix11 12 13 1 1

22 23 2 2

33 3 3

0

0 0

a a a x b

a a x b

a x b

=

To solve a general linear problem we search for a way of transforming the matrix to atriangular form (which we know already how to solve). Formally, we seek the so-called LU decomposition in which the general A matrix is decomposed into a lower triangularmatrix L , and an upper triangular matrixU ( A LU = ). Note that if such adecomposition is known, we can solved the linear problem in two steps


14/123

Step 1.

( )

find using the lower triangular matrix

Ax b

LUx b

L Ux b

Ly b y L

==

=

=

Step 2.find x using the upper triangular matrixUx y U =

A way of implementing the above idea in practice is using Gaussian elimination.Gaussian elimination is an action that leads to a LU decomposition discussed above,even if the analogy is not obvious. In this course we will not prove the equivalence.

Gaussian eliminationConsider the following system of linear equations x x x x x x x x x x x x x x

x x x x x x x

x x x x x x x

x x x x x x x

=

where x denotes any number different from zero.We can eliminate the unknown1 x from rows 2 to n (the general matrix is of sizen n ).We multiply the first row by1 11ia a and subtract the result from rowi . By repeating theprocess 1n times we obtain the following (adjusted) set of linear equations that has thesame solution

0000

x x x x x x x

x x x x x x

x x x x x x

x x x x x x

x x x x x x

=

We can work on the newly obtained matrix in a similar way to eliminate2 x from row3 3to n. This we do by multiplying the second row by2 22ia a and subtract the results fromrows 3 to n. The (yet another) new matrix and linear equations will be of the form

00 00 00 0

x x x x x x x

x x x x x x

x x x x x

x x x x x

x x x x x

=


15/123

It should be obvious how to proceed with the elimination and to create (an upper)triangular matrix that we know by now how to solve.


16/123

Random numbers

Clearly a number that is produced on the computer in a deterministic way cannot be trulyrandom. So a valid question is what do we mean by a random number and how can wetest it?

We consider pseudo-random numbers that share some properties with random numbersbut obviously are reproducible on the computers and therefore are not truly random.

A useful definition of true random numbers is lack of correlations. If we consider theproduct of two random numbers1r and 2r , -- 1 2r r and we average over possible valuesof 1r and 2r , we should have1 2 1 2r r r r = .

So a test for a random number generator would be?

11 1

.... N N

N ii i

r r r = =

=

Essentially all the existing random number generators fail eventually on this kind of test.The common generators are cyclic in nature. There is an L -- large integer such that

i L ir r + = , hence the number of random numbers that can be generated is finite.

Widely used random number generators are based on the following simple (and fast)operations:

1 (mod m)k k I I + = +

The integers k I are between zero and m-1. Dividing by m provides a floating pointbetween 0 and 1. If all is well the sequence of the integers is uniformly distributed at theinterval [0,1]

Example: 322,147,437,301 453,816,981 m=2 = =

Using random numbers suggests a procedure to estimate

To improve the quality and the randomness of numbers generated by the aboveprocedure it is useful to have a long vector of random numbers and to shuffle them(randomly)In MATLABRand(n,m) provide an nxm matrix of random numbers.

The above procedure provides random numbers generated from a uniform distribution.Can we generate random numbers from other probability distribution (e.g. normal)?


17/123

A general procedure for doing it is based on the probability function. Let ( ) p x dx be theprobability of finding between x and x dx+ . Suppose that we want to generate a seriesof points x and then compute a function of these points ( ) y x . What will be thedistribution of the y -s? It will be connected to the probability function of the x -s.

( ) ( )

( ) ( )

p x dx p y dy

dx p y p x

dy

=

=

Example: suppose ( )( ) loge y x x=

( ) ydx p y dy dy e dydy

= =

Another example: Gaussian

We want21( ) exp / 2

2 p y dy y dy

=

select( ) ( )

( ) ( )

( )

( ) ( )

1 1 2

2 1 2

2 21 1 2

22

1

1 1

1 2 2 21 2

2 2

1 2

2log cos 2

2log sin 2

exp / 2

1 arctan2

1 1exp / 2 exp / 22 2

y x x

y x x

x y y

y x

y

x x y y

y y x x y y

=

=

= +

=

=

Note that there is one-to-one correspondence between x and y


18/123

CS 428: Homework I ([*] =15 points bonus)Due Thursday Sept 14 at 2:30PM.

1. Write a Matlab function that computes the interaction energy of N watermolecules and the forces that they exert on each other. It is useful to separate the

calculation to different functions, one function that computes the internal energyof the water molecules, and one function that computes the Lennard Jones andelectrostatic interactions. The input should be the coordinate vector and thenumber of water molecules and the output the potential value and the gradient ofthe potential. Check the analytical gradient by a finite difference formula.

2. Construct five trial configurations for a water dimer (your choice of a minimumenergy configuration) and evaluate their energies. Did you hit a minimum?

3. Write steepest descent algorithm to minimize the energy of the water dimer andminimize it. Report a plot of the relative conformation of the molecules. Prepare aMatlab movie that follows the minimization path.

4. (*) Write a code that computes the second derivative of the energy of the water

dimer.5. (*) Refine the optimal structure obtained from the steepest descent minimizationusing Newton Raphson minimization. Report the changes in structure/energy

In your submission include code, input, and output and a brief explanation. Bothelectronic and hard copies are required (send your electronic copy to [email protected]


19/123

HW4

Write a code that compute the structure and the energy of a two dimensional H/P polymeon a two dimensional lattice.

Consider a chain of length 14-mers. Enumerate all possible configurations of the chain othe lattice. How many independent (mirror-image symmetry unrelated) conformations diyou find?

Consider the following three sequence 14H , PPHHHHHHHHPPPP (8H,6P) andPPPPHHHHHHPPPP (6H,8P). Find the global energy minimum for each of thesesequences. Which of the sequence is more stable?


20/123

1

Statistics

Sampled from

Morris H. DeGroot & Mark J.Schervish, Probability and Statistics,

3rd Edition, Addison Wesley


21/123

2

Probability: Continuous distributionand variables

Continuous distributions Random variables Probability density function Uniform, normal and exponential distributions Expectations and variance Law of large numbers Central limit theorem Probability density functions of more than one

variable Rejection and transformation methods for sampling

distributions (section)


22/123

3

Statistics

Estimators: mean,standard deviation Maximum likelihood Confidence intervals statistics Regression Goodness of fit

2


23/123

4

Continuous random variables

A random variable X is a real valuefunction defined on a sample space S. X is a continuous random variable if a non-

negative function f, defined on the realline, exists such that an integral over the

domain A is the probability that X takes avalue in domain A. ( A is, for example, theinterval [a,b])

( ) ( )Prb

a

a X b f x dx< < =


24/123

5

Probability density function

f is called probability density function(p.d.f.). Note that the unit of the pdf beloware of 1/length, only after the multiplicationwith a length element we get probability

For every p.d.f. we have

( )

( )

0

1.

f x

f x dx

=


25/123

6

Examples of p.d.fs

A car is driving in a circle at a constant speed.What is the probability that it will be found in theinterval between 1 and 2 radians?

A computer is generating with equal probabilitydensity, random numbers between 0 and 1.What is the probability of obtaining 0.75?

Protein folds at a constant rate (the probabilitythat a protein will fold at the time interval [t,t+dt] is a constant dt ). If we have at time zero N 0 protein molecules, what is the probability that allprotein molecules will fold after time t ?


26/123

7

Uniform distribution on an interval

Consider an experiment in which a point X is selected from an intervalin such a way that the probability of finding

X at a given interval is proportional to theinterval length (hence the p.d.f. is a

constant). This distribution is called theuniform distribution We must have for thisdistribution

}:S x a x b=

( ) ( ) 1b

a

f x dx f x dx

= =


27/123

8

Uniform distribution (continue)

( )

1 for

0 otherwise

a x b f x b a

=

a b

1/(b-a)

f(x)

x


28/123


29/123

10

Exponential distribution

( ) ( )exp 0 f x a ax x= < <

x

f(x)


30/123

11

Normal distribution

( ) ( )( )1/ 2

20exp

a f x a x x x

= < <

xo

1 a


31/123

12

Continuous distribution functions( )defined as Pr( ) for -F x X x x= < <

F(x) is a monotonic non decreasing function of x (can you show it?),that can be written in terms of its corresponding p.d.f.

( ) ( ) ( )

( )

Pr

or

x

F x X x f x dx

dF f x

dx

= =

=


32/123

13

Distribution function: Example

( ) ( )

( ) ( ) ( ) ( ) ( 00

exp for 00 otherwise

exp exp 1 exp

x x x

a ax x f x

F x f x dx a ax dx ax ax

< < =

= = = =


33/123

14

Expectation

For a random variable X with a p.d.f. f(x)the expectation E(X) is defined

( ) ( )

( )

The expectation exists if and only if the integral is absolutely converg

E X x f x dx

x f x dx

=

<


34/123

15

Expectation (example)

( )

( )13

0

2 0 10 otherwise

22 23 3

x x f x

x E X x x dx

<


35/123

16

The Cauchy p.d.f.

( ) ( ) ( )( )

( ) ( ) ( ) ( )

( ) ( )

2

2

2

1 01

1 1 1arctan arctan21

1 11 12 2 1

x x

f x x f x x

F x dx x x x

F dx x

= < <

+ = = = +

= + = = +


36/123

17

Cauchy distribution: Expectation

( ) ( ) ( )2

Test for existence of expectation1

1Expectation exist for the Cauchy distributdoes not on.

E X x f x dx x dx x

= =

+


37/123

18

Some properties of expectations

Expectation is linear

If the random variables X and Y areindependent then

( ) ( ) ( ) E aX bY aE X bE Y + = +

( ) ( ) ( )( ), f x y f x f y=

( ) ( ) ( ) E X Y E X E Y =


38/123

19

Expectation of a function

Is essentially the same as the expectationof a variable

( )( ) ( ) ( ) ( )

( ) ( ) ( ) ( )2

22 2

of special interest is the expectation value of moments

variance

E r x r g r dr r x f x dx

E X E X x f x dx x f x dx

= =

=

Can you show that the variance is always non-negative?


39/123

20

Functions of several randomvariables

We consider a p.d.f.

of several random variables

The p.d.f. satisfies (of course)

( 1,..., n f x x

1,..., n X X

( )

( )1 2

1 2

1

1 1

,..., 0

... ,..., ... 1n

n

n

bb b

n na a a

f x x

f x x dx dx

=


40/123

21

Expectation of function of severalvariables

Similarly to one variable case,expectations of functions with severalvariables are computes

( )( ) ( ) ( )1 1 1 1,..., ... ,..., ,..., ...n n n n E Y r x x r x x f x x dx dx

= =


41/123

22

Example: expectation of more thanone variable

( ) ( )

( ) ( ) ( )

( )

1 12 2 2 2

0 01 1

2 2

0 0

1 for ,,0 otherwise

is a square: 0 1 0 1

,2 3

x y S f x y

S x y

E X Y x y f x y dx dy

x y dx dy

=

< < < 0

Prove it Why E(X)>t is not interesting?

( )Pr 0 1 X =

( ) ( )Pr E X

X t t


43/123

24

Chebyshev Inequalityis a special case of the Markov inequality

X is a random variable for which the varianceexists. For t>0

Substitute

to obtain the Markov inequality

( )( ) ( )2 2 2varPr X X E X t t

( ) ( ) ( )2 2 =var X and byY X E X E Y t t =


44/123

25

The law of large numbers I

Consider a set of N random variablesi.i.d. Each of the random variables hasmean (expectation value) and variance 2

The arithmetic average of n samples isdefined . It defines anew random variable that we call thesample mean

The expectation value of the sample mean

1,... n X X

( )11 ...n n X X X n

= + +

( ) ( )1 1n ii

E X E X nn n

= = =


45/123

26

The Law of Large Numbers II

The variance of n X ( ) ( )( ) ( ) ( )

( ) ( ) ( )

( ) ( ) ( ) ( ) ( )

( ) ( ) ( )( ) ( )

22 2

2 2,

2 2 2 2 2

22 2

1 1var

Since and are independent for1var

var var

n n n i j ii j i

i j i j i j

n

n

X E X E X E X X E X n n

X X i j E X X E X E X

X n E X n n E X n E X n

X E X E X n X n

= =

=

= +

= =

Which means that the variance is decreasing linearly with thenumber of sampled points


46/123

27

Law of Large numbers III

( )( ) ( )( ) ( )2 22 2 2Chebyshev Inequality:

var1 Pr Pr 1 for 0n n

n

X X X

n

X

= <


47/123

28

Central Limit Theorem

Statement without proof: Given a set of random variables

with mean i and variance 2i we define anew random variable

For very large n, the distribution ofis normal with mean and variance

1,..., n X X

1,...,1

22

1,...,

ii n

n

ii n

X Y

=

=

=

1,...,i

i n X =

1,..,i

i n

=

2

1,...,i

i n

=


48/123

29

Statistical Inference

Data generated from unknown probabilitydistribution and statement on the unknowndistribution are warranted. Determineparameters (e.g. for exponentialdistribution, and for normal

distribution) Prediction of new experiments


49/123

30

Estimation of parameters

Notation: f(x| ) is the probability density of sampling x given (conditioned on) parameters .

For a set of n independent and identically distributedsamples the probability density is:

However, what we want to determine now are theparameters For example assuming the distributionis normal, we seek the mean and the variance 2

( ) ( ) ( )11,...,

,..., | | |n ii n

f x x f x f =

= x

( ) ( )2

22

1| , exp2 2

x f x

=


50/123

31

Bayesian arguments

What we want is the functiongiven a set of observations x , what is theprobability that the set of parameters is ?

Bayesian statistics: Think of theparameters like other random variables

with probability ( ).

( )| f x

( ) ( ) ( )( ) ( ) ( )

The joint probability , | is also, |

f f

f f g

x x

x x x


51/123

32

The likelihood function

We can formally write

which is the probability of having a particular setof parameter for the p.d.f provided a set ofobservation (what we wanted). Note that ourprime interest here is in the parameter set

and the samples of x is given. Since g(x) isindependent of we can write the likelihoodfunction

( ) ( ) ( )

( )

|| f g

= xx

x

( ) ( ) ( )| | f x x


52/123

33

Example: Likelihood function I

Consider the exponential distribution

And assume the p.d.f. of the parameter is aGaussian with a mean and variance of 1.

( ) [ ]

( ) 1,...,

exp for 0|0 otherwise

exp for 0|

0 otherwise

ni

i n

x x f x

x x f

=

>= > =

x

( )21 exp

22

=


53/123

34

Example: Likelihood function II

( )( )

2

1/ 21,...,

1| exp exp22

ni

i n

x

=

=

x


54/123

35

Maximum Likelihood

( ) ( )( )We look for a maximum of the function log |as a function of the parametersn L f

= x

As a concrete example we consider the normal distribution

( ) ( )

( ) ( ) ( )22 2 1,...,

log | ,

1 log 2 log2 2 2

n

ii n

L f

n n x

=

=

=

x

To find the most likely set of parameters we determinethe maximum of L( )


55/123

36

Maximum of L( ) for normaldistribution

( )

( ) ( )

( )

2 21,..., 1,...,

1,...,

2

22 2 2 1,...,

22

1,...,

1 10 221

102 2

1

i ii n i n

ii n

ii n

ii n

dL x x nd

xn

dL n x

d

xn

= =

=

=

=

= = =

=

= + =

=


56/123

37

Determine a most likely parameter forthe uniform distribution

( )

( ) ( )

1 for 0|0 otherwise1 for 0 1,...,|0 otherwise

in

x f x

x i n f

= = =

x

It is clear that must be larger than all the x i and at the same time maximizesthe monotonically decreasing function , hence1 n

[ ]1max ,..., n x x =


57/123

38

Potential problems in maximum likelihoodprocedure

Value of is underestimated (note that should be larger than all x , notonly the ones we sample so far)No guarantee that a solution exists for the distribution below must belarge than any x but at the same time equal to the maximal x. This is notpossible and hence, no solution

The solution is not necessarily unique

( )1 for 0|0 otherwise

x f x

<


58/123

39

The 2 distribution with ndegrees of freedom

( )( )

( ) ( ) /2 1 /21 exp 2 0

2 2n

n f x x x xn= >

( ) ( ) var 2 E x n x n= =

There is a useful relation between the 2 and the normaldistributions

( ) 10

0n t n t e dt n = >


59/123

40

Theorem connecting 2 and normal distributions

If the random variables X 1,,X n are i.i.d. and if each ofthese variables has standard normal distribution, thenthe sum of the squares

Has a 2 distribution with n degrees of freedom

2 21 ... k nY X X = + +

( ) ( ) ( ) ( )( ) ( )

( ) ( )( ) ( ) ( ) ( ) ( )( ) ( )( )

2 1/ 2 1/ 2

1/ 2 1/ 2

1/ 21/ 2

1/ 2 1/ 2

The distribution functionsPr Pr Pr

The p.d.f is obtained by differentiating both side '

' . Note 2 exp / 2 . We have

1 2

F y Y y X y y X y

y y

f y F y

y y y y

f y y y

= = =

= =

= =

= + ( )( )( ) ( ) ( )

1/ 2 1/ 2

1/ 2 1/ 2

2

1 2

2 exp / 2which is the distribution with one degree of freedom

y y

f y y y

=


60/123

41

Normal distribution: ParametersLet X 1,,X n be a random sample from normal distribution having mean

and variance 2 . Then the sample mean (hat denotes M.L.E)

and the sample variance

are independent random variables.

has a normal distribution with a mean and variance 2 /n .

has a chi-square distribution of n-1 degrees of freedomWhy n-1 ? (next slide)

( )221,...,

1 i ni n

X X n

=

=

2 2 / n

1,...,

1 n ii n

X X n

=

= =


61/123

42

Parameters of the normaldistribution: Note 1

Let be a vector of random number oflength n sampled from the normal distribtuion Let be another vector of n random

numbers, related to the previous vector by lineartransformation A (AA t =I)

Consider now the calculation of the variance(next slide)

1,..., n x x1,..., n y y

=y Ax


62/123


63/123

44

Variance is not changing uponlinear transformations

Consider the expression

The analysis is based on the unitarity of A .Hence, linear transformation dos change thevariance of he distribution. This makes itpossible to exploit the difference between

( ) ( ) ( )

( ) ( ) ( )

2

1,..., 1,...,2

1,..., 1,...,

t

i n i n i ni n i n

t t t t t i n i n i n

i n i n

Y Y X X X X

X X X X X X

= =

= =

=

=

A A A A

A A

andn X


64/123

45

The n-1 (versus n) factor

Since A is arbitrary ( as long as it is unitary). We canchoose one of the transformation vectors a to be(1,,1)/n 1/2

The scalar product

Is identically zero (remember how we compute themean?)

Hence since we computed the average from the samesample we computed the variance, the variance lost onedegree of freedom.

0t n X X =a a


65/123

46

The n-1 factor II

Note that the n-1 makes sense. Consideronly a single sample point, which is ofcourse very poor and leaves a high degreeof uncertainty regarding the value sof theparameters. If we use n then the estimatedvariance becomes zero, while if we use n-

1 we obtain infinite, which is moreappropriate to the problem at hand, forwhich we have no information todetermine the variance


66/123

47

The t distribution(in preparation for confidence

intervals) Consider two random variables Y and Z , suchthat Y has chi-2 distribution with n degrees of

freedom and Z has a standard normaldistribution the variable X is defined by

Then the distribution of X is the t distribution with ndegrees of freedom.

1/ 2Y

X Z n =


67/123

48

The t distribution The function is tabulated and can be written in terms

of function

The t distribution is approaching the normaldistribution as . It has the same meanbut longer tails.

( )( )

( )1 /22

1/ 2

12 1 for

2

n

n

n x

t x xn nn

++ = + < <

( ) ( )10

exp x x dx

=

n


68/123

49

Confidence Interval

Confidence interval provide an alternativeto the use of estimator instead of theactual value of an unknown parameter.We can find an interval (A,B) that we thinkhas high probability of containing the

desired parameter. The length of theinterval gives us an idea how well we canestimate the parameter value.


69/123

50

Confidence interval: Example

Sample distribution is normal with mean and standard deviation . We expect tofind a sample S in the intervals

About 68.27% 95.45% and 99.73 of the timerespectively

; 2 ; 3S S S


70/123

51

Confidence interval for means

If the statistics S has the sample meanthen 95% and 99% confidence limits forestimation of the population mean aregiven by and respectively.

For large samples we can write(depending on the level of confidence weare interested in)

For small sample we need to t distribution

( )30n

X

1.96 X X 2.58 X X

c X zn


71/123

52

Confidence interval for the mean ofthe normal distribution

Let for a random sample from a normaldistribution with unknown mean and unknown variance.Let t n-1 (x) denote the p.d.f of the t distribution with n-1degrees of freedom, and let c be a constant such that

For every value of n, the value of c can be found fromthe table of the t distribution to fit the confidence(probability)

1,...,

n X X

( )1c

nc

t x dx

=


72/123

53

Confidence interval for meanssmall sample (n


73/123

1

Water

A very simple molecule that consists of 3 atoms: one oxygen and two hydrogen atoms.Some remarkable properties.Very high melting and vaporization points for liquid with

so light molecular mass. For example, CCl4 melt at 250K, while water at 273K.Anomalous density behavior at freezing (expansion in density ice lighter than liquidwater, ice 0.92 g/mL, liquid water 1.0g/mL).Very strong electrical forces (specialorientation forces?). High dielectric constant

Explain with simple microscopic model macroscopic behavior? Explain alsomicroscopic data, since macroscopic observation are too few.

StructuresCorrelation functions spatial & time.Spatial correlation functions (first and second peak) as determinant of interaction

strengthsTime correlations measures (for example, dielectric response).

A water molecule is neutral but it carriesa large dipole moment . The oxygen is highlynegative and the hydrogen positive. The dipole moment of the molecule (it is not linear)is 1.855 Debye unit (Debye = 3.3356410-30 C m). Electron charge and distance of oneangstrom corresponds to 4.803 Debye.

The geometry of a single water molecule is determined by the following parameters(using a model potential TIP3P position of an atom as light as hydrogen in quantummechanics is subject to considerable uncertainty): r(OH) 0.9572 a(HOH) 109.47 Since

the hydrogen atoms are symmetric, the dipole moment is an experimental indication thatthe molecule is not linear. The individual molecules are set to be rigid (no violations ofthe above internal coordinates are allowed). We shall use harmonic restraints to fix thegeometry (at least until we will learn how to handle holonomic constraints). For internalwater potential, we write

( ) ( ) ( )2 2 2internal 1 2 1 20.9572 0.9572 ' 1.5139o h o h h hU k r k r k r = + +

wherek and k are constants chosen to minimize the changes in the distances compared tothe ideal values, and are chosen empirically to be 600 kcal/mol2.

The potential betweenn water molecules using the TIP3P model is

( ) ( ) between molecules 12 6 , ,ik jl

i j i j k l ij ij ik jl

c c A BU

r oo r oo r K

> >

= +

the indices ,i j are for water molecules, the indices ,k l are for the atoms of a singlemolecule. Note the newly inserted constant K that is used for unit consistency. In our caseit is 332.0716. The first sum is over the oxygen atoms only and includes repulsion


74/123

2

between the different terms. This type of potential is also called Lennard Jones. Thehydrogen atoms (deprived from electrons) are so small that their influence on the hard-core shape of the molecule is ignored. The hard core is set to be spherical with theoxygen atom at the center. Hydrogen atoms (and oxygen atoms too) are coming to play inthe second term of electrostatic interactions.

The potential parameters are as follows

12

6

0

629,400 kcal /mol625.5 kcal /mol

0.8340.417 H

A

B

c

c

=== =

To begin with we will be interested in the water dimer. The interaction between the twomolecules can be written more explicitly (below) as

1 02 1 21 1 22 2 11 2 12 between molecules 12 6

1 2 1 2 1 2 1 21 1 22 1 21 2 12

11 21 11 22 12 21 12 22

11 21 11 22 12 21 12 22

o o h o h o h o h

o o o o o o o h o h o h o h

h h h h h h h h

h h h h h h h h

c c c c c c c c c c A BU

r r r r r r r

c c c c c c c cr r r r

= + + + + +

+ + + +

For the first monomer we useo1 h11 & h12 to denote the corresponding three atoms. Forthe second monomer we used insteado2 h21 & h22 . Of course the modeling of the

interaction between molecules must come together with the internal potential keeping thegeometry fixed. Note that we do not compute electrostatic and Lennard Jones interactionsof atom within the same molecule.

The water dimer has an optimal geometry (a minimum energy configuration) that we nowwish to determine. This is a question of how to arrange negatively charged atoms close to positively charge atoms and keep similar charges away from each other.Together withthe fix geometry of the individual molecules this is a frustrated system (you cannotmake everybody happy a classic example is of three spins) that also has implications onthe docking problem and drug design (determining the orientation of two interactingmolecules) and the range of complementing interactions and shape matching

What is the dimensionality of the problem ? We have three atoms in each of the watermolecules making a total of eighteen degrees of freedom. The actual number relevant forthe docking is much smaller. Of the eighteen six are of translation and rotation of thewhole system (not changing the relative orientation) and another six are reduced to theinternal constraints on the structure on the structure of the molecule. Hence, only sixdegrees of freedom remain when we attempt to match the two (rigid) water moleculestogether.


75/123

3

This is possible but requires considerable work for implementation and makes itnecessary to write the energy in considerably less convenient coordinates (the mostconvenient are just the Cartesian coordinates of the individual atoms). So rather than being clever at the very first step, we will use the simplest approach.In the simplest

approach we attempt to optimize the energy of the water dimer as a function of alldegrees of freedom . This requires more work from the computer, but less work from us.Minimizing the energy (and find the best structure) in this large 18 dimensional space issomething we cannot do by intuition and we must design an algorithm to do it.

The Algorithm of steepest descent is one way of doing it, and this is the first approachthat we are going to study. Consider a guess coordinate vector for the water dimer that wedenote by R (capitals denotes arrays or vectors, lower case denotes elements of a vector).It is a vector of length 18 and includes all the x,y,z coordinates. What is the best way oforganizing the coordinates? In some programs the vector of coordinates is decomposedinto three Cartesian vectors ( ) ( )1 2 18 1 18 1 18, , , ,... , ,..., , ,..., R X Y Z x x x y y z z = . This set up is

however less efficient from the view point of memory architecture. When we computedistances between atoms (the main computation required for the energy calculations) werequire the (x,y,z) coordinates of the two atoms. By aggregating all the Cartesiancomponents together (e.g. all the x-s are coming first) we may create cache misses. Tofind the Cartesian components of one atom makes it necessary to walk quite far alongthe array, every time we need one of the Cartesian components we need to step throughn numbers to pick the next Cartesian component. This is not a serious problem for 18atoms but for hundred of thousands or millions of atoms (like some simulations are) itmight be.A more efficient structure that we shall adopt is therefore:

( )1 1 1 18 18 18, , ,..., , , R x y z x y z = where the coordinates that belong to a single particle are kept together, making it lesslikely to have cache misses.

Starting from an initial configuration 0 R we want to find a lower energy configuration.The argument is that low energy configurations are more likely to be relevant. This is truein general but is not always sufficient. We shall come back when discussing entropytemperature and how to simulate thermal fluctuations.

For the moment assume that we are going to make a small step a displacement

( )0 R R R = of norm ( )20i ii

R q q = . We have used the so-called norm 2distance measure which is very common in computational biophysics. We have also usedanother variable iq , which is any of the Cartesian coordinates of the system. For a systemwith N atoms, it has3N iq . It is useful to know that a more general formulation would


76/123

4

be ( )1

( )0

nnni i

i

q q = . These different distances emphasize alternative aspects of the

system. For example whenn the norm approaches the largest displacement along agiven Cartesian direction in the system. Another widely used distance is for 1n = which

is Manhattan distance.The displacement R is made in a large space of 18 degrees of freedom, so while wefixed the size of the displacement to be small there is still a lot to be done to find anoptimal displacement that will reduce the energy as much as possible (for a given size ofa displacement). How are we going to choose the displacement (given that we arechoosing the norm of the displacement to be small). Here is a place where the Taylorsseries can come to the rescue. We expand the potential ( )U R in the neighborhood of thecurrent coordinate set 0 R to the first order in (other higher order terms are consideredsmall and are neglected).

( ) ( ) ( ) ( ) ( )0

0 0 0 0

i i

t i i

i i q q

dU U R R U R q q U R U Rdq

=

+ + = + 0

The expression( )t U means a transpose vector of rank 3 N , which is also called thegradient of the potential

( )1 2

, ,...,t N

dU dU dU U

dq dq dq

=

Similarly we have for R ( )

( )

( )

1 10

2 20

0

...

N N

q q

q q R

q q

=

The expression for the potential difference is a scalar (or inner) product between twovectors. Hence we can also write

( ) ( ) ( ) ( )0 0 2 2 cos cosU R R U R U U R U + = = We have omitted the 2 from the expression of the vector norm A on the right handside. We will always use the norm 2, unless specifically suggested otherwise, andtherefore carrying the 2 around is not necessary. Note that the gradient of the potential

is something that we compute at the current point,0 R , and it is not something that we canchange. Similarly we fixed the norm of the displacement . So the only variable intown is the direction of the displacement with respect to the gradient of the potential ( ) .To minimize the difference in energies (making ( )0U R R + as low as possible) weshould make ( )cos as small as possible. The best we could do is to make ( )cos 1 =


77/123

5

and choose the step in the opposite direction to the gradient vector. Hence, the steepestdescent algorithm for energy minimization is

U R

U

=

We did not discuss so far how to choose the step length , except to argue that it should be small. We expect that if the gradient is very small then we are near a minimum(actually a stationary point where 0U = ). In that case only a small step should be used.On the other hand if the gradient is large, a somewhat larger step could be taken since weanticipate a significant change in the function (to the better). Of course the step stillcannot be too large since our arguments are based on a linear expansion of the function.Based on the above arguments, it is suggestive to use the following (simpler) expressionfor the displacement that is directly proportional to the length of the gradient vector.

R U = We can imagine now a numerical minimization process that goes as follows:0. Init -- 0 R R=

1. Compute ( )U R

2. Compute a step R U = and a new coordinate set R R R = + 3. Check for convergence( )U If converged, stop. Otherwise return to 1.

This procedure finds for us a local minimum, a minimum to which we can slide downdirectly from an initial guess. It will not provide a solution to the global optimization problem, finding the minimum that is the lowest of them all.We can take the above expression one step further and write it down as a differentialequation with the coordinates a function of a progress variable . This is not the mostefficient way of minimizing the structure under consideration but it will help usunderstand the process better. We writedR

U d

=

where the (dummy) variable is varying from zero to . At we approach thenearby stationary point. Not that the potential ( )( )U R is a monotonically non-increasing function of . This is easy to show as follows. Multiple both side of the

equation byt

dRd

, we have

t t

t i

i i

dR dR dRU d d d

dqdR dU U

d d dq

=

=

Assuming that the potential does not have explicit dependence of (i.e. that is morethan a dummy variable which contradicts our initial set-up) we can write the finalexpression in a more compact and illuminating form


78/123

6

0

0

t dR dR dU d d d

dU

d

=

Hence as we progress the solution of the differential equation the potential energy isdecreasing in the best case and is not increasing in the worst case. There can be differentvariants of how to choose the norm of the step - , we already mentioned two of them.

Another important variant of the steepest descent minimization is to make the step size a parameter and to optimize it for a given search direction defined byU . One approachwould be to performa search for a minimum along the line defined by the U , i.e. we seek a such that ( ) ( ) 0t U R U R U + = as a function of the single scalarvariable . This one-dimensional minimization makes sense if the calculation of thegradient can be avoided in the one-dimensional minimization, and search steps in theone-dimensional minimization can be computed more efficiently than the determinationof the direction. For example, in the one-dimensional optimization only calculation of theenergy ( )U R U + can be used in conjunction with the approach of interval halving.We guess a given 0 and if the energy is going up we try a half of the previous size

0 2 , if it is going down we double it. We continue to evaluate the progress of halvingan interval that contains a minimum until we hit a minimum with desired properties(halving on the left or halving on the right results in an increasing energy).This schemeassumes that the calculation of the potential energy is a more efficient than thecalculation of the gradient . This is however not the case here and for the task at handthe line search option of the steepest descent algorithm is not efficient.

We note that compared to other optimization algorithms (such as conjugate gradient thatwe shall not discuss, and the Newton-Raphson algorithm that we will) the steepestdescent algorithm is considerably slower. However, it is a lot more stable than (forexample) the Newton Raphson approach. It is a common practice in molecularsimulations to start with a crude minimizer like the Steepest Descent algorithmdescribed above to begin with (if the initial structure is pretty bad), and then to refine thecoordinates to perfection using something like the Newton Raphson algorithm, which ison our agenda.

A few words about computing potential derivatives (which is of prime importance forminimization algorithms in high dimensions. It is unheard of having effective minimizersin high dimensions without derivatives, since we must have a sense of direction where togo. That sense of direction is given by the potential gradient.

Overall, the potential derivatives are dominated by calculations of distances. It istherefore useful to consider the derivative of a distance between two particles, as afunction of the Cartesian coordinates.


79/123

7

( ) ( ) ( )( )

( ) ( ) ( )

2 2 2

2 2 2 , ,

ij i j i j i j

i jij

i i j i j i j

r x x y y z z

w wdr w x y z

dw x x y y z z

= + +

= =

+ +

With the above formulas at hand it is straightforward to do all kind of derivatives withextensive use of the chain rule. For example the Lennard Jones term

( )

( )

12 6 13 7 13 7

14 8

12 6 12 6

12 6

k jkj

i j j k j k k ij ij kj kj k kj kj kj

k j j k kj kj

w wdr d A B A B A Bdw r r r r dw r r r

A Bw w

r r

>

= + = +

= +

Note that the expression above depends only on even power of the distance (14 and 8)which is good news, meaning that no square roots are required. Square roots are the curseof simulations and are much more expensive to compute compare to add/multiply etc.Unfortunately, electrostatic interactions: energy and derivatives yield odd powers, and aremore expensive to compute. The potential and its derivatives require square rootcalculations. And here is the expression for a set of independent atoms (for conveniencewe forget about molecules here)

( ) ( )2 3, , , ,

i j i j i jk ik i

i j i j i jk i j i k i k i k

c c c c c cw wd w w

dw r r r r >

= =

Here is a little tricky question. To compute the energy, (which is a single scalar),for N atoms we need to calculate 2 2 N terms (assuming all unique distancescontribute to the energy), and then add them up. How many terms we need tocompute for the gradient of the potential?

Here is another trick question going back to our data structure. Suppose that I amadding strong electric field, E , along the Z axis at a specific direction and thecorresponding (added) energy takes the form E field i i

i

U c E z = . Is the datastructure that we proposed (keeping the Cartesian components of one particlevector together is still ok?

Calculations of potential gradients are a major source of errors in programming molecularmodeling code. It is EXTREMELY useful to check the analytical derivatives againstnumerical derivatives computed by finite difference, when the expected accuracy should be of at least a few digits. For example


80/123

8

( ) ( )1 1,..., 2,..., ,..., 2 ,..., k N k N k

U q q q U q q qdU k

dq

+

0

For the water system the step can be 610 (using double precision) and the expectedaccuracy is at least of 3-4 digits. More than that suggest an error.

This concludes (in principle) our discussion of thesteepest descent minimizationalgorithm. From the above discussion it is quite clear that minimization in theneighborhood of a stationary point of the potential (like a minimum) is difficult . Near the minimum the gradient of the potential is close to zero, subject to potentialnumerical error and support the use of only extremely small step that are harder toconverge numerically. In that sense the algorithm we discuss next is complementary tothe steepest descent approach. It is working well in the neighborhood of a minimum. It isnot working so well if the starting structure is very far from a minimum, since the largestep taken by this algorithm (Newton Raphson) relies on the correctness of thequadratic expansion of the potential energy surface near a minimum .

We start by considering a linear expansion of the potential derivatives at the current position R in the neighbor hood of the desired minimumm R . At the minimum, thegradient is (of course) zero. We have

( ) ( ) ( )2

0m i im j ji j i

d U U R U R q q

dq dq = + 0

The entity that we wish to determine from the above equation ism R the position of theminimum. The square bracket with a subscript[ ]... j denotes the j vector element. The last

term is a multiplication of matrix by a vector. From now onwards we shall denote thesecond derivative matrix byU # . We are attempting to do so by expanding the forcelinearly in the neighbor hood of the current point. This is one step up in expansioncompared to the steepest descent minimization; however, it is not sufficient in general. Inthe simplest version of the Newton Raphson (NR) approach very large steps are allowed.Large steps can clearly lead to problems if the linear expansion is not valid. Nevertheless,the expansion is expected to be valid if we are close to a minimum, since any function inthe neighborhood of a minimum can be expanded (accurately) up to a second order term

( ) ( ) ( ) ( ),

12

t

m m mi j

U R U R R R U R R + #

Note that we did not write down the first order derivatives (gradient) since they are zeroin at the minimum. This is one clear advantage of NR with respect to steepest descentminimizer (SDM). SDM relies on the first derivatives only, derivatives that vanish in theneighborhood of a minimum. It is therefore difficult for SDM to make progress in theecircumstances while NR can do it in one single shot as we see blow.We write again the equation for the gradient in a matrix form

( ) ( ) ( ) ( )0m mU R U R U R R R = + #0


81/123

9

which we can formally solve (for m R ) as

( )( ) ( )1

m R R U R U

= #

The matrix ( )1

U # is the inverse of the matrixU # , namely ( )

1U U I

=# # where I is the

identity matrix. In principle there is nothing in the above equation that determine thenorm of the step m R R that we should take. If the system is close the quadratic thesecond derivative matrix is roughly a constant and a reasonably large step to ward theminimum can be taken without violating significantly validity of the above equation. Theability to take a large step in quadratic like system is a clear advantage of NR compare toSDM. However for systems that are not quadratic the matrix is not a constant and onlysmall steps (artificially enforced) should be used. In our water system NR should be usedwith care and only sufficiently close to a minimum. There are a few technical points that

specifically should concern us with the water dimer optimization problem. We will beconcern with the following1. Does the matrixU # have an inverse, and what can we do if it does not?2. How to find the inverse (or solve the above linear equations)?

The bad news is that the matrixU # for molecular systems (as the water dimer is) does nothave in general an inverse. The problem is the six degrees of freedom that we mentionedearlier and do not affect the potential energy: three overall translation and three overallrotations have zero eigenvalues making the inverse singular. This is easy to see asfollows

Let the eigenvectors of theU # be ie and the eigenvalues i . It is possible to write thematrix U # as the following sum

t i i i

i

U e e = # where the outer product

( )

1 1 1 1 2 1

2 2 1 2 2 21 2

1 2

...

......

... ... ... ... ......

i i i i i i iN

i i i i i i iN t i i i i iN

iN iN i iN i iN iN

e e e e e e e

e e e e e e ee e e e e

e e e e e e e

= =

generates a symmetric NxN matrix. Since the vectorsie are orthogonal to each other( i j ije e = ), it is trivial to write down the inverse in this case

( )1 t

i i

i i

e eU

= # However, this expression is true only if all i are different from zero. The eigenvectorsrepresent directions of motion and the eigenvalues are associated with the cost in energy


82/123

10

for moving in a direction determined by the corresponding eigenvector. However movingalong the direction of global translation (or global rotation) does not change the energy,therefore their corresponding eigenvalues must be equal to zero, and the inverseimpossible to get.

One way of getting around this problem is by shifting the eigenvalues of the offendingeigenvectors. If we know in advance the six offending eigenvectors we can raise theireigenvalues to very high values (instead of zero) by adding to the matrix outer productsof these vectors multiplied by very high value (see below). Contribution of eigenvectorswith very high value will diminish when we compute the inverse, since the inverse isobtained by dividing by the corresponding eigenvalues. We do not have to find all theeigenvectors and the eigenvalues as is written above, it is sufficient if we affect the fewspecific eigenvectors and define a new matrixU # to obtain a well behaved( )

1U

# . Thequestion remains (of course) is how to find these six eigenvectors. The moststraightforward (and inefficient) way to do it is to actually compute the eigenvectors andeigenvalues by a matrix diagonalization procedure. Lucky we do not have to do that sincethe translation and rotation eigenvectors are known from the Eckart conditions. We havefor translation

( ) ( )0 =0 , ,i i i i i i ii

m r r r x y z = and for rotation

( )0 0i i i ii

m r r r = Mention the possibility of using Lagrange multipliers here

The last multiplication is a vector product, and the difference 0i ir r is assumed to besmall. The coordinate vectors 0

ir are reference vectors used to define the coordinate

system and are constants.

For the record, we write the vector product explicitly

( ) ( ) ( )0 0 0 0 0 0 00 0 0

x y z

i i i i i x i i i i y i i i i z i i i i

i i i

e e e

r r x y z e y z z y e x z z x e x y y x

x y z

= = +

The two equations defined six constraints. This is since they are vector equations, each ofthe vectors has 3 components x,y,z. To obtain the eigenvectors associated with theseconstraints we need to compute the gradients of the above constraints. We have three

vectors for translations that we denote by , ,tx ty tz e e e , and three vectors for the rotation, ,rx ry rz e e e .

Here are the translation vectors( )( )( )

1 2

1 2

1 2

,0,0, ,0,0,..., ,0,0 0, ,0,0, ,0,...,0, ,0 0,0, ,0, 0, ,..., 0,0,

tx N

ty N

tz N

e m m m

e m m m

e m m m

=

=

=


83/123

11

And here are the rotations.

( )( )

( )

0 0 0 0 0 01 1 1 1 2 2 2 2

0 0 0 0 0 01 1 1 1 2 2 2 2

0 0 0 0 0 01 1 1 1 2 2 2 2

0, , ,0, , ,...,0, ,

,0, , ,0, ,..., ,0,

, ,0, , ,0,..., , ,0

rx N N N N

ry N N N N

rz N N N N

e m z m y m z m y m z m y

e m z m x m z m x m z m x

e m y m x m y m x m y m x

=

=

=

A vector that is orthogonal to the above six does not include overall rotation ortranslation component. So our goal is to work in the reduced space that does not includethe above six. Note that the Eckart conditions are linear so the constraints are constant(independent of the current coordinates). This makes the manipulation of these vectorsstraightforward to do, and doing it only once at the beginning of the calculation. Oneapproach is to modify input vectors (project out from them the offending part). This ishowever, quite expensive and will need further work for any incoming vector. A muchsimpler procedure is to modify the matrix, which is what we shall do.

*** IMPORTANT CORRECTION: For the application of the Eckart conditions to theoptimization problem we must set all the masses to one. In the calculation of the potentialthe masses was not used and it should not be used also in the construction of theconstraints. So we have (for minimization)

( ) ( )0 =0 , ,i i i i i ii

r r r x y z = and for rotation

( )0 0i i ii

r r r =

*** End of correction

Note that the so produced six vectors are not orthogonal. They span the complete six-foldspace of eigenvectors with eigenvalues that are equal to zero. However, attempting touse them within our procedure require them to be orthonormal. We are making it into that point by performing Gram Smith process on the space of the constraints derivatives.

What we do is basically the following. We have a set of N linearly independent vectorsWe pick one of them at random (call it for convenience1e ) and normalize it

' 11

1

e

ee

=

Let 2e be the second vector that we pick from the set. We make it orthogonal to'1e and

normalize it.( )

( )

' '2 2 1 1'

2 ' '2 2 1 1

t

t

e e e ee

e e e e

=


84/123

12

The next vector on the agenda we make it orthogonal to the previously constructed vector'1e and

'2e and normalize it. The same process is used for the rest of the base vectors

With orthonormal representation of the constraint space{ }' 1,...,6i ie = , we can redefine the

second derivative matrix using a shifting procedure to MUCH higher values for all theoverall body motions. We have

( )' '1,...,6

' t

up i ii

U U e e =

= + # #

where up is a very large number in accord with the idea that we promote earlier (oncethe inverse is computed 1 up will be significantly lower than anything else (say

810up = for the water dimmer).

The modified second derivative matrix is now ready to prime time NR optimization,since it has a straightforward inverse (we must be a little careful though, it is possible thata point on the energy surface will be found for which the second derivative is zero even ifit is not global motion). Here we ignore this possibility, assuming that we are sufficientlyclose to a minimum such that all the non-zero eigenvalues are positive. In the case thatthe quadratic expansion is accurate (and in sharp contrast to SDM) the minimization willconverge in one step.

What will happen to our solution if the eigenvalu

general notes on computational biophysics

Documents