conformational space
DESCRIPTION
Conformational Space. Conformational Space. Conformation of a molecule: specification of the relative positions of all atoms in 3D-space, Typical parameterizations : List of coordinates of atom centers List of torsional angles (e.g., the f - y - c for a protein) - PowerPoint PPT PresentationTRANSCRIPT
Conformational Space
Conformational Space Conformation of a molecule: specification
of the relative positions of all atoms in 3D-space,
Typical parameterizations: List of coordinates of atom centers List of torsional angles (e.g., the -- for
a protein)
Conformational space: Space of all conformations
Conformational Space
q1
qi
q2
qj
qN-1
qN
Conformational Space
q1
q3
q0
qn
q4
Relation to Robotics/Graphics
q1
q3
q0
qn
q4
q2
(t)
Configuration space
Need for a Metric
Simulation and sampling techniques can produce millions of conformations
Which conformations are similar? Which ones are close to the folded
one? Do some conformations form small
clusters (e.g. key intermediates while folding)?
Metric in Conformational Space
A metric over conformational space C is a function:
d: c,c’ C d(c,c’) +{0}such that: d(c,c’) = 0 c = c’ (non-degeneracy) d(c,c’) = d(c’,c) (symmetry) d(c,c’) + d(c’,c”) d(c,c”) (triangle
inequality)
But not all metrics are “good”
Euclidean metric:
d(c,c’) = i=1,...,n(|i-i’|2+ |i-i’|2)
Metric in Conformational Space
A “good” metric should measure how well the atoms in two conformations can be aligned
Usual metrics: cRMSD, dRMSD
RMSD
Given two sets of n points in 3 A = {a1,…,an} and B = {b1,…,bn}
The RMSD between A and B is:
RMSD(A,B) = [(1/n)i=1,…,n||ai-bi||2]1/2
where ||ai-bi|| denotes the Euclidean distance between ai and bi in 3
RMSD(A,B) = 0 iff ai = bi for all i
cRMSD Molecule M with n atoms a1,…,an Two conformations c and c’ of M ai(c) is position of ai when M is at c
cRMSD(c,c’) is the minimized RMSD between the two sets of atom centers:
minT[(1/n)i=1,…,n||ai(c) – T(ai(c’))||2]1/2
where the minimization is over all possible rigid-body transform T
cRMSD cRMSD verifies triangle inequality cRMSD takes linear time to compute Often, cRMSD is restricted to a
subset of atoms, e.g., the C atoms on a protein’s backbone
Representation Restricted to C Atoms
Protein 1tph
- The positions of AA residue centers (Cα atoms) mainly determine the structure of a protein.- In structural comparison, people usually work only on the backbone of Cα atoms, and neglect the other atoms.
Possible project: Design a method for efficiently finding nearest neighbors in a sampled conformation space of a protein, using the cRMSD metric.
dRMSD
Molecule M with n atoms a1,…,an
Two conformations c and c’ of M {dij(c)}: nn symmetrical intra-molecular
distance matrix in M at c dRMD(c, c’) is :
[(1/n(n-1))i=1,…,n-1j=i+1,…,n(dij(c) – dij(c’))2]1/2
{dij} is usually restricted to a subset of atoms, e.g., the C atoms on a protein’s backbone
Intra-Molecular Distance Matrix
Distances between C pairs of a protein with 142 residues. Darker squares represent shorter distances.
Intra-Molecular Distance Matrix
Distances between C pairs of a protein with 142 residues. Darker squares represent shorter distances.
1
40
85
45
Intra-Molecular Distance Matrix
dRMSD
Molecule M with n atoms a1,…,an
Two conformations c and c’ of M {dij(c)}: nn symmetrical intra-molecular
distance matrix in M at c dRMSD(c, c’) =
[(2/n(n-1))i=1,…,n-1j=i+1,…,n(dij(c) – dij(c’))2]1/2
{dij} is usually restricted to a subset of atoms, e.g., the C atoms on a protein’s backbone
dRMSDdRMSD
Molecule M with n atoms a1,…,an
Two conformations c and c’ of M {dij(c)}: nn symmetrical intra-molecular
distance matrix in M at c dRMSD(c, c’) =
[(2/n(n-1))i=1,…,n-1j=i+1,…,n(dij(c) – dij(c’))2]1/2
{dij} is usually restricted to a subset of atoms, e.g., the C atoms on a protein’s backbone
Advantage: No aligning transform Drawback: Takes quadratic time to compute
Is dRMSD a metric? dRMSD(c, c’) =
[(2/n(n-1))i=1,…,n-1j=i+1,…,n(dij(c) – dij(c’))2]1/2
is a metric in the n(n-1)/2-dimensional space, where a conformation c is represented by {dij(c)}
But, in this representation, the same point represents both a conformation and its mirror image
k-Nearest-Neighbors Problem
Given a set S of conformations of a protein and a query conformation c, find the k conformations in S most similar to c (w.r.t. cRMSD, dRMSD, other metric)
Can be done in time O(N(log k + L)) where: - N = size of S- L = time to compare two conformations
k-Nearest-Neighbors Problem
The total time needed to compute the k nearest neighbors of every conformation in S is O(N2(log k + L))
Much too long for large datasets where N ranges from 10,000’s to millions!!!Can be improved by:
1. Reducing L 2. More efficient algorithm (e.g., kd-tree)
kd-Tree
In a d-dimensional space, where d>2, range searching for a point takes O(dn1-1/d)
k-Nearest-Neighbors Problem
Idea: simplify protein’s description
cRMSD O(n) timedRMSD O(n2) time
Assume that each conformation is described by the coordinates of the n C atoms
This representation is highly redundant
Proximity along the chain entails spatial proximity
Atoms can’t bunch up, hence far away atoms along the chain are on average spatially distant
3d l
ci cj
m-Averaged Approximation
Cut the backbone into fragments of m C atoms Replace each fragment by the centroid of the m
C atoms Simplified cRMSD and dRMSD
3n coordinates 3n/m coordinates
8 diverse proteins (54 -76 residues)
Decoy sets of N =10,000 conformations from the Park-Levitt set [Park et al, 1997]
Evaluation: Test Sets[Lotan and Schwarzer, 2003]
m cRMSD dRMSD3 0.99 0.96-0.98
4 0.98-0.99 0.94-0.97
6 0.92-0.99 0.78-0.93
9 0.81-0.98 0.65-0.96
12 0.54-0.92 0.52-0.69Higher correlation for random sets ( greater savings)
Correlation:
Running Times
Further Reduction for dRMSD
1) Stack m-averaged distance matrices as vectors of a matrix A
Ar
N
Vector ai of elements of distance matrix of
ith conformation (i = 1 to N)
1 n n
r 12 m m
2
m j ji i
1dRMSD (c,c )= a-a
r
Further Reduction for dRMSD
1) Stack m-averaged distance matrices as vectors of a matrix A
2) Compute the SVD A = UDVT
A(rxN)
r
N
U(rxr)
D(rxr)
VT
(rxN)=
SVD Decomposition
Vector aj of elements of distance matrix of
jth conformation (j = 1 to N)
Orthonormal(rotation) matrix
Diagonal matrix
A(rxN)
r
N
U(rxr)
VT
(rxN)=
SVD Decomposition
Vector aj of elements of distance matrix of
jth conformation (j = 1 to N)
Orthonormal(rotation) matrix
Diagonal matrix
s1
s2
sr
0
0
s1 s2 ... sr 0 (singular values)
A(rxN)
r
N
U(rxr)
D(rxr)
VT
(rxN)=
SVD Decomposition
Vector aj of elements of distance matrix of
jth conformation (j = 1 to N)
Orthonormal(rotation) matrix
Diagonal matrix
Matrix withorthonormal rows
vjT
vkT
vi and vj are orthogonal unit Nx1 vectors
A(rxN)
r
N
U(rxr)
D(rxr)
VT
(rxN)=
SVD Decomposition
r-dimensional space
x
y
X
Y
Representation ofA in space (X,Y)
2
m j ji i
1dRMSD (c,c )= a-a
rdoes not depend on thecoordinate system!
1 n n
r 12 m m
v1T
v2T
A(rxN)
r
N
U(rxr)
D(rxr)
VT
(rxN)=
SVD Decomposition
s1
s2
s3
sr
||s1v1|| ||s2v2|| ...
v1T
v2T
A(rxN)
r
N
U(rxr)
D(rxr)
VT
(rxN)=
SVD Decomposition
s1
s2
s3
sr
vpT
p principal components
A(rxN)
r
N
U(rxr)
D(rxr)
VT
(rxN)=
SVD Decomposition
s1
s2
sp
v1T
v2T
vpT
p principal components
0
Further Reduction for dRMSD
1) Stack m-averaged distance matrices as vectors of a matrix A
2) Compute the SVD A = UDVT
3) Project onto p principal components
Correlation PC
4dRMSDbetween dRMSD and
is reduced to summing up 12 to 20 terms(instead of ~ 80 to 200, since the proteins have 54 to 76 amino acids)
PC4dRMSD
Complexity of SVD SVD of rxN matrix, where N > r, takes
O(r2N) time Here r ~ (n/m)2
So, time complexity is O(n4N) Would be too costly without m-
averaging
Evaluation for 1CTF Decoy Sets
[Lotan and Schwarzer, 2003] N = 100,000, k = 100, 4-averaging, 16 PCs 70% correct, with furthest NN off by 20% Brute-force: 84 h Brute-force + m-averaging: 4.8 h Brute-force + m-averaging + PC: 41 min kD-tree + m-averaging + PC: 19 min Speedup greater than x200 6k approximate NNs contain all true k NNs Use m-averaging and PC reduction
as fast filters