principal component analysis in md simulation speaker: zhou chen-yang supervisor: wu yun-dong
TRANSCRIPT
Principal Component Analysis in MD Simulation
Speaker: ZHOU Chen-Yang
Supervisor: Wu Yun-Dong
Methods to analyze MD trajectory
• Intuition-based coordinates– RMSD with respect to native state– Fraction of native contacts – Radius of gyration– Other observables
• Advantage– Easy to understand– Convenient to do
• Disadvantage– Inaccurate– Ineffecctive for non-native structures, or without good
reference structure– Depend on previous knowledge
How to measure conformational change?
What we have to do:
• Reduce dimension • Trajectory is too complicated• Good projection should be able to seperat of noise and signal
• Classification/Clustering• Classify structures to different states
• Algorithms include:• PCA: Principal Component Analysis• MDS: Multi-Dimensional Scaling
If we already have optimal reaction coordinate
Then we have: free energy landscape,
transition pathway, transition rate ...
But usually we don't, and it doesn't come up automatically
dPCA vs RMSD
The figure represents the free energy landscape of Trp-zip2 at 300K, using Amber force field 99sb*-ildn. Projected to 2nd principal component and RMSD.
Genaral description of PCA
• The central idea of PCA is to:– reduce the dimension
– retain the variation
• An example:– (x,y) is a randomly generated
dataset• var(x) = 3.2, var(y) = 2.3
– (x,y) is either centered at (0,0) or at (3,3), which are mixed
– PCA generates new coordinate (x',y'), and x' captures most of the variation
• var(x') = 5.5, var(y') = 0.99
Key question understanding PCA
• In practice, the principal components (PCs) are some linear combination of original coordinates.
• Suppose we have a set of data containing 2 columns X1 and X2. Now we generate a new column of data Z=a1X1+a2X2, what is the variance of Z?
Variance and covarianceExample: Z=X1+X2
Why is it important? Because we are going to project the data set to a new coordinate Z, and our attemp is to choose a (a1, a2) to maximize the variance of Z.
Z=a1X1+a2X2:
Represented with matrix multiplication:
Covariance Matrix: Σ Coefficients of original
coordinate in PC, α
var(Z)=Var(αX)=α'Σα
Next step: change ato search the maximum of var(Z)
Z=X1+X2:
Maximize var(Z)
First, we have to normalize a:
Then, maximize var(Z) is to maximize
Differentiate with respect to a1
l is the eigen value and a1 is the corresponding eigen vector of S
eigen value ploted from large to small
Pick first several eigen vector as PC, or actually the coefficient of PCs. Then project data to PCs, and the simplified data could be further analyzed with orther techniques such as clustering.
PCA in application: Cartesian coordinates
• Cartesian coordinates contain all the imformation
• But often noisy
cPCA: cartesian PCAuse cartesian coordinate
Mu, Y., Nguyen, P. H., & Stock, G. (2005). Proteins, 58(1), 45–52.
Dashed blue line: Cartesian PCA
Comparison of cPCA and dPCA in the analysis of Ala7 MD simulation
Full red line: PCA using dihedral angle
PCA in application: cPCA, dPCA and pPCA
Advangtage: 1. reduction of dimensionality2. constraint within coordinateProblem with dihedral: 1. dihedral angle is periodic 2. dihedral angle is not linear
In application, people transform dihedral angle to its sin/cos values to do PCA, called dPCA
Application of dPCA: (Ab16-22)6
Nguyen, P. H., Li, M. S., Stock, G., Straub, J. E., & Thirumalai, D. (2007). PNAS, 104(1), 111–6.
Free-energy diagram projected onto the first two principal components V1 and V2 of the dPCA forthe hexamer.
dPCA in RNA analysis: flexible choice of internal coordinates
Riccardi, L., Nguyen, P. H., & Stock, G. (2009). JPCB, 113(52), 16660–8.
• REMD simulation of a short b-hairpin Trp-zip2 using:– ff99sb-ildn– ff99sb*-ildn– ff99sb-ildn-nmr– ff99C, our modified version of ff99sb-ildn
Using dPCA to compare Trp-zip2 potential energy surface in different force field
Using dPCA to compare Trp-zip2 potential energy surface in different force field
Free energy landscape of Trp-zip2 at 300K, using Amber force field 99sb*-ildn. Projected to 1st and 2nd principal component, using dPCA of turn region. The reason for the extended energy surface is that it cannot form stable hairpin.
Native like turn
Helical structure
Using dPCA to compare Trp-zip2 potential energy surface in different force field
The figure represents the free energy landscape of Trp-zip2 at 300K, using Amber force field 99sb-ildn. Projected to 1st and 2nd principal component of 99sb*-ildn, using dPCA of turn region.
Native like turn
Helical structure
Using dPCA to compare Trp-zip2 potential energy surface in different force field
The figure represents the free energy landscape of Trp-zip2 at 300K, using Amber force field 99sb-ildn-nmr. Projected to 1st and 2nd principal component of 99sb*-ildn, using dPCA of turn region. 99sb-ildn-nmr cannot fold the Trp-zip2 hairpin.
Native like turn
Helical structure
The figure represents the free energy landscape of Trp-zip2 at 300K, using force field 99C. Projected to 1st and 2nd principal component of 99sb*-ildn, using dPCA of turn region. In our force field, Trp-zip2 form stable beta-turn so that it rarely sample other conformation.
Using dPCA to compare Trp-zip2 potential energy surface in different force field
Native like turn
Summary
• PCA is a linear transformation of old coordinates to capture maximum variance
• Instead of using Cartesian coordinates, dihedral angles could be a better choice in description of conformational change
• General coordinates or a subset of coordinates (for region of interest) can be used for PCA analysis
• The result of PCA could used for further analysis such as clustering and transition rate calculation.
Thank you!Thank you!