v24 hybrid-methods for macromolecular complexes

40
24. Lecture WS 2008/09 Bioinformatics III 1 V24 Hybrid-methods for macromolecular complexes Structural Bioinformatics (a) Integration of structures of various protein components into one large complex. What to do if density is too small or too large? Sali et al. Nature 422, 216 (2003)

Upload: jackie

Post on 06-Jan-2016

35 views

Category:

Documents


1 download

DESCRIPTION

V24 Hybrid-methods for macromolecular complexes. Structural Bioinformatics (a) Integration of structures of various protein components into one large complex. What to do if density is too small or too large?. Sali et al. Nature 422, 216 (2003). Correlation-based fitting. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 1

V24 Hybrid-methods for macromolecular complexes

Structural Bioinformatics

(a) Integration of

structures of various

protein components into

one large complex.

What to do if density is

too small or too large?

Sali et al. Nature 422, 216 (2003)

Page 2: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 2

Correlation-based fitting

Wriggers, Chacon, Structure 9, 779 (2001)

Correlation-mapping can also be used to position small fragments into large

templates.

It can also be adapted to accomodate molecular flexibility during fitting.

Page 3: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 3

Aim: Accelerated Correlation-Based Fitting with FFT

Wriggers, Chacon, Structure 9, 779 (2001)

The initial data sets are a low-resolution map (target) and an atomic structure (probe), corresponding to direct space densities em(r) and atomic(r), respectively (blue box). The probe molecule is subject to a rotation matrix R (red box) that can be constructed from the three Euler angles. After lowering the resolution of the atomic structure (by direct space convolution with a Gaussian g) to that of the target map, the rotated probe molecule corresponds to the simulated density calc(r). An optional filter e (e.g., a Laplacian) can be applied to both em (r) and calc(r) before the structure factors are computed (f denotes the FFT and the asterisk denotes the complex conjugate).

The definition of a direct space convolution of a density function b(r) with a kernel a(r) is given in the green box. The definition of the direct space correlation C as a function of a translational displacement T is given in the orange box. By virtue of the Fourier correlation theorem, C can be computed for all T from the inverse Fourier transform of the previously calculated structure factors.

Page 4: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 4

Matching densities

Orienting the two lattices can be done with respect to 6 degrees of freedom, 3 for

translation along x, y, and z, and 3 for rotation around the angles , , and .

Among all these possibilities, one wishes to identify the relative orientation x, y, z, , ,

that minimizes the sum of least squares

Here, R,, is a three-dimensional rotation matrix and Tx,y,z is a translation operator that

translates molecule B to the position x, y, z. Minimizing the sum of squared errors is

equivalent to maximizing the linear cross-correlation of A and B,

for a given translation vector (x,y,z) and rotation (, , ).

2,,,,),,,,,( BAzyxR zyx RT

N

l

N

m

N

nnmlzyxnmlzyx baC

1 1 1,,,,,,,,,, RT

Intuitively, we want to compute the

overlap of the two densities after placing

the two lattices on top of each other.

But what means 'on top of each other' in

mathematical terms?

Page 5: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 5

Situs package: Automated low-resolution fitting

Chacon, Wriggers J Mol Biol 317, 375 (2002)

The data sets need to be compared at comparable resolution project atomic structure B on the cubic lattice of the EM data A

by tri-linear interpolation, and convolute each lattice points bl,m,n with a

Gaussian function g.

N

l

N

m

N

nznymxlnmlzyx bgaC

1 1 1,,,,,,

The complexity of computing this correlation

for all translations in direct space is O(N6):

O(N3) for every value of x,y,z.

The total complexity of this algorithm

is therefore O(N6) number of rotations

Page 6: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 6

Laplacian filter for edge enhancement In the absence of hard boundaries, the contour of a low-resolution object is

contained in the 3D edge information instead of a 2D surface.

A simple and computationally cheap filter for 3D edge enhancement is the

Laplacian filter:2

2

2

2

2

22

z

f

y

f

x

ff

that approximates the Laplace operator of the second derivative.

Applied to the density gradient on a grid, the Laplacian filtered density

can be quickly computed by a finite difference scheme:

ijkijkijkkijkijjkijki

ijkijkijkijk

ijkkijijkkij

ijkjkiijkjkiijk

aaaaaaa

aaaa

aaaa

aaaaa

6111111

11

11

112

where aijk and 2aijk represent the density and the Laplacian filtered density at

grid point (i,j,k). The expression compares the values at grid points +1 and -1

along all three directions to the value of the grid point ijk.

Page 7: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 7

Schematic view of a Laplacian filter

ai-1jk, aijk, and ai+1jk are the density values at three neighboring grid points in one

direction. The grey lines denote the difference between the central point and

the values to the left and to the right. These are finite difference approximations

of the first derivative left and right of the grid point ijk.

The dotted line and dotted arrow illustrate how the two first derivatives are

combined to obtain an approximation of the second derivative at grid point ijk

by finite difference as ai+1jk + ai-1jk -2 aijk.

Page 8: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 8

Effect of Laplacian filter

111111,,,,2 6 lmnlmnnlmnlmmnlmnlnmlnml aaaaaaaa

Chacon, Wriggers J Mol Biol 317, 375 (2002)

Include „surface“ information in the volume docking procedure.

Laplacian filter:

Effect of Laplacian filter:Left: cross-section of 15Å simulated density of RecA hexameric structure.Right: same density after application of Laplacian filter.

Secondary derivatives are maximal here because signal increases in various directions.

Page 9: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 9

Efficient evaluation of correlation by FFT

N

l

N

m

N

nznymxlnmlzyx bgaC

1 1 1,,

2,,

2,,

Chacon, Wriggers J Mol Biol 317, 375 (2002)

Geometric match between two molecules A and B can be measured by the

Laplacian cross-correlation:

6D rigid-body search has complexity N6.

Common problem in protein-ligand and in protein-protein docking.

Efficient solution (Katchalski-Kazir algorithm):

use FFT because FFT has complexity N3logN3

nmlnmlzyx bgFFTaFFTFFTC ,,2*

,,21

,,

Page 10: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 10

Situs package: success case

Chacon, Wriggers J Mol Biol 317, 375 (2002)

Fitting of tubulin components to an

experimental 20Å resolution map

of microtubuli.

Without any a priori consideration

about the relative orientation of

and tubulins, the atomic

structure of the -tubulin dimer

could be reconstructed to within

2Å of the known dimer X-ray

structure (labeled by Nogales et

al.).

Page 11: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 11

Core-weighted fitting + Grid-threading Monte-Carlo

Wu, Milne, .., Subramaniam, Brooks, J Struct Biol 141, 63 (2003)

Idea: define „core“ region of a structure as the part whose density distribution is

unlikely to be altered by the presence of adjacent components.

„Surface“ region: is accessible/may interact with other components.

Use again Laplacian filter defined by a finite difference approximation to define the

boundary of the surface:

where aijk and 2aijk represent the density and the Laplacian filtered density at grid

point (i,j,k).

ijkijkijkkijkijjkijkiijk aaaaaaaa 61111112

Page 12: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 12

Core-weighted fitting I core index

Wu, Milne, .., Subramaniam, Brooks, J Struct Biol 141, 63 (2003)

Define core index, which describes the depth of a grid point located within this

core:

where fijk is the core index of grid point (i,j,k),

ac is a cutoff density

min[fi1jk, fij1k ,fijk1] represents the minimum core index of the neighboring grid

points around grid point (i,j,k).

otherwise1min

0minand00

0minand0

1,1,1

1,1,12

1,1,1

ijkkijjki

ijkkijjkiijk

ijkkijjkicijk

ijk

fff

fffa

fffaa

f

Page 13: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 13

Core-weighted fitting I core index

Wu, Milne, .., Subramaniam, Brooks, J Struct Biol 141, 63 (2003)

The core index is zero for grid points outside the core and increases progressively

for grid points located deeper in the core.

A grid point outside the core region must neighbor at least one grid point that is

also outside the core.

A grid point within the core cannot neighbor a grid point outside the core unless it

satisfies the condition 2aijk 0 and aijk > ac.

Use this iterative procedure for calculating the core incex:

(a) initialize core index so that all core indices are 1 except the grid points at the

boundary

(b) loop over all grid points

(c) repeat (b) until all grid points satisfy equation on p.31.

otherwise1

or1oror1oror10 zyxijk

nkknjjniif

Page 14: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 14

Core indices for 2 proteins and their complex

Grid points labelled by value of core

index.

Regions of protein density are colored

grey.

For both proteins, the core index is 0

outside the domains, 1 at the outer

edge and becomes larger inside the

proteins.

Bold numbers indicate the core indices

of proteins A and B that change upon

formation of the AB complex.

Wu, Milne, .., Subramaniam, Brooks, J Struct Biol 141, 63 (2003)

Page 15: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 15

Core-weighted correlation function

nm

nmnmmn aa

aaaaDC

Wu, Milne, .., Subramaniam, Brooks, J Struct Biol 141, 63 (2003)

The match in density between two maps is again described by a density

correlation function (DC):

m and n refer to the two maps being compared,

x y zn

i

n

j

n

kzyx

kjiannn

a ,,1

and 22 aaa

represent the average and fluctuation of the density fluctuation.

Alternatively, one can use the Laplacian correlation (LC)

nm

nmnmmn aa

aaaaLC

22

2222

Page 16: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 16

Core-weighted fitting I core index - properties

Wu, Milne, .., Subramaniam, Brooks, J Struct Biol 141, 63 (2003)

We expect the following features when we consider the match between the map

of an individual component and the map of a multicomponent assembly:

1. If the core region of an individual component matches the core region of the

complex, the distribution property of this core region should not change

appreciably for the correct fit.

2. If the surfaces match, the distribution property of this surface region should

not change appreciably for the correct fit.

3. If the surface (low core index) of an individual component matches the core

(high core index) of the complex, the distribution property of the surface

region should change significantly for the correct fit.

4. If the core (high core index) of an individual component matches the surface

(low core index) of the complex, it cannot be a correct fit.

A correlation function works fine for scenarios 1, 2, and 4 to distinguish the correct

fit from wrong fits.

Page 17: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 17

Core-weighted fitting I core index - algorithm

cbff

fw

an

am

am

mn

Wu, Milne, .., Subramaniam, Brooks, J Struct Biol 141, 63 (2003)

one needs to minimize the contribution from scenario 3 in the correlation

function calculation. Can be achieved by „down-weighting“ such matches.

Use

where wmn is the core-weighting function for the individual component m to the

complex n. a, b, c are suitable parameters.

core-weighted correlation function

where represents a core-weighted average of property X:

nwmw

wnwmwnmmn XX

XXXXXCW

kjimn

kjimn

w kjiw

kjiXkjiw

X

,,

,,

,,

,,,,

wX

and 22www XXX

Page 18: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 18

Core-weighted fitting I core index - algorithm

nwmw

wnwmwnmmn aa

aaaaCWDC

Wu, Milne, .., Subramaniam, Brooks, J Struct Biol 141, 63 (2003)

If we choose densities for the calculation, we obtain the core-weighted density

correlation (CWDC)

and if we choose to apply the Laplacian filter, we obtain the core-weighted

Laplacian correlation (CWLC)

nwmw

wnwmwnmmn aa

aaaaCWLC

22

2222

The core-weighted correlation functions are designed to down-weight the regions

overlapping with other components, while emphasizing the regions with no

overlap.

Page 19: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 19

Grid-threading Monte-Carlo

Shown on the right is a grid-threading

Monte Carlo search in 2D. It is a

combination of a grid search and a

Monte Carlo sampling.

The conformational space is divided

into a 3×3 grid. From each of the 9 grid

points, short MC searches (shown as

purple curves) are performed to locate a

nearby local maximum.

The global maximum is identified from

among these local maxima. Only

conformations along the 9 Monte Carlo

paths are searched.

Wu, Milne, .., Subramaniam, Brooks, J Struct Biol 141, 63 (2003)

Page 20: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 20

Algorithm

(1) For a protein component, divide 6D search

space to provide initial conformational states

covering the whole space:

nx ny nz for translational sampling

n n n for rotational sampling

(2) Perform MC search starting from each grid

point over NMC steps. At each ‚move‘ the

component is translated along a random vector

(xr, yr, zr) and then rotated around x,y,z axes

for random angles (r,r,r).

A ‚trial move‘ is accepted if

and rejected otherwise.

T is a reduced temperature.

Wu et al. J Struct Biol 141, 63 (2003)

T

CC oldnewexp

Page 21: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 21

Algorithm

(3) Nonoverlapping local maxima are stored in

sorted, linked list. Step (2) is repeated until

all grid points are searched

(4) Identify global maximum from linked list and

assign to component.

(5) Repeat steps (1) to (4) until all components

have been fitted into the density map.

Wu et al. J Struct Biol 141, 63 (2003)

Page 22: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 22

Test of Core-weighting method

(a) The X-ray structure of TCR

variable domain (PDB code:

1A7N) and a 15 Å map generated

from the structure using pdblur

from Situs.

(b) The -chain (red) at the

maximum density correlation

position. The -chain is at its X-

ray position for reference.

Wu et al, J Struct Biol 141, 63 (2003)

Observation: DC identifies wrong global

maximum for this 15 Å map.

Other methods are more stable at lower

resolutions (see table).

Page 23: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 23

Performance of systematic sampling

The maximum core-weighted density

correlations between the map of TCR

-chain and the map of the TCR

complex identified from grid searches of

the 6D conformational space (n6 grid

points). 15 Å resolution maps.

Black dashed line: correlation value for

the X-ray coordinates.

An exponential increase in grid

sampling size is required to improve the

correlation values.

grid searches are computationally

inefficient.

Wu, Milne, .., Subramaniam, Brooks, J Struct Biol 141, 63 (2003)

Page 24: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 24

Performance of grid search and Monte Carlo

The core-weighted density correlation

function as before during Monte Carlo

searches starting from each of the 26

grid points.

The Monte Carlo searches were

performed with max=15 Å, max=30°,

and T=0.01. Each line represents one

Monte Carlo search procedure.

The ability to converge to the correct fit

and the speed of convergence depend

significantly on the starting position.

Wu et al. J Struct Biol 141, 63 (2003)

Useful strategy: identify best local fit

by short MC search. Select global fit

among these candidates.

This is the basis of the grid-threading

MC search.

Page 25: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 25

Performance of different correlation functions

The rms deviations of the best fits

from the X-ray structure using

different correlation functions.

RMSD > 20 Å indicates that search

converged to a far maximum.

MC with DC alone does not converge to the correct fit. This is due to the

fact that map resolutions were 15 Å or worse where DC does not work.

Laplacian correlation works until 15 Å,

Core-weighted density correlation until 20 Å

and core-weighted Laplacian correlation even at 30 Å.

Wu et al. J Struct Biol 141, 63 (2003)

Page 26: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 26

Success case

(a) Surface representation of the experimental map (at 14 Å resolution) of the

icosahedral complex formed from 60 copies of the E2 catalytic domain of the

pyruvate dehydrogenase.

(b) The X-ray structure of the same complex (PDB code: 1B5S).

Wu, Milne, .., Subramaniam, Brooks, J Struct Biol 141, 63 (2003)

Page 27: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 27

Success case continued

Comparison of the location of the E2 catalytic domain obtained using a GTMC search

(green) with that of the corresponding domain from the X-ray structure (red). The

experimental EM map is shown in blue.

(a) The best fit obtained, RMS=2.13 Å;

(b). The worst fit obtained, RMS=6.52 Å. The grid-threading Monte Carlo search was

conducted with a 46 grid, Nmc=5000, max=30 Å, max=30°, and T=0.01.

The core-weighted Laplacian correlation function was used. The average RMSD

of the C backbone (averaged over all 60 copies) between the X-ray structure

and the fitted coordinates is 3.73 Å.

Wu et al. J Struct Biol 141, 63 (2003)

Page 28: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 28

SOM: surface overlap maximization

Ceulemans, Russell J. Mol. Biol. 338, 783 (2004)

I preprocessing: all voxels with density < cut-off are set to ‚false‘

all remaining voxels to ‚true‘ ‚template volume‘

‚target volume‘ (atomic structure in PDB format):

placed in a 3D grid with voxel size equal to that of the above

density map.

For grid voxel i, i [1,3N]for all atoms in voxel i

sum #electrons

end

store estimate of electron density in voxel i

end

smoothen model to the resolution of the density map.

Page 29: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 29

SOM (II) fast fitting round

Ceulemans, Russell J. Mol. Biol. 338, 783 (2004)

Score goodness-of-fit by surface overlap: fraction of surface voxels of the

transformed target that are superimposed on template surface.

Determine all combinations of translations and rotations (around origin) that

project at least one surface voxel of the target onto the template surface.

Effort? target surface voxel a and template surface voxel b

find set of transformations that superimpose a onto b.

Each such transformation can be decomposed into the unique translation of a

to b and a rotation about b.

Expectation: rotations need to be searched exhaustively.

Page 30: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 30

SOM (II) fast fitting round

Ceulemans, Russell J. Mol. Biol. 338, 783 (2004)

Interestingly, many rotations about b need not to be explored.

If a really is the counterpart of b, the optimal transformation will superimposed the plane

tangent to the target surface in a onto the plane tanget to the template surface in b.

only 1 rotational degree of freedom, around vb, has to be searched

In practice, the vector va, is estimated: a and its 26 spatial neighbors are interpreted as

vectors. Subtract all neighbors of a that score ‚true‘ in the volume matrix, from a.

Page 31: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 31

SOM (II) fast fitting round

Ceulemans, Russell J. Mol. Biol. 338, 783 (2004)

Page 32: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 32

SOM (II) fast fitting round

Ceulemans, Russell J. Mol. Biol. 338, 783 (2004)

Page 33: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 33

SOM (II) fast fitting round

Ceulemans, Russell J. Mol. Biol. 338, 783 (2004)

Page 34: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 34

Mod-EM

Topf, ..., Sali J. Mol. Biol. 357, 1655 (2006)

Task: Comparative (homology) modelling is imprecise at sequence identity levels

of 10 % x 30 %, the so-called „twilight zone“.

Idea: use different homology models, combine with experimental EM density.

Select model with best combined fitness function.

csZwZwF21

Zs : (statistical potential score – mean ) / standard deviation The statistical potential score of a model is the sum of the solvent accessibility terms for all C atoms and

distance-dependent terms for all pairs of C and C atoms. The solvent-accessibility term for a C atom

depends on its residue type and the number of other C atoms within 10Å; the non-bonded terms depend

on the atom and residue types spanning the distance, the distance itself, and the number of residues

separating the distance-spanning atoms in the sequence. These potential terms reflect the statistical

preferences observed in 760 non-redundant proteins of known structure.

The density-fitting Z c-score is the maximized cross-correlation coefficient between the cryoEM density map and the probe (model) density calculated with Mod-EM. The normalization relies on the mean and standard deviation obtained from a population of ca. 7500 alignments constructed in 25 iterations of the Moulder program with the original fitness function that depends only on the statistical potential. When the fit is good, the density-fitting Z-score is positive; it usually ranges from -10 to 10. Five protocols of Moulder-EM were tested, corresponding to different weights ([w1,w2]) of [1,0], [1,1], [1,2], [1,8], and [0,1] for the statistical potential Z-score and the density-fitting Z-score in the fitness function, respectively.

Page 35: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 35

Mod-EM

Topf, ..., Sali J. Mol. Biol. 357, 1655 (2006)

Page 36: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 36

Mod-EM

Topf, ..., Sali J. Mol. Biol. 357, 1655 (2006)

Page 37: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 37

Mod-EM

Topf, ..., Sali J. Mol. Biol. 357, 1655 (2006)

Page 38: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 38

Mod-EM

Topf, ..., Sali J. Mol. Biol. 357, 1655 (2006)

Page 39: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 39

Mod-EM

Topf, ..., Sali J. Mol. Biol. 357, 1655 (2006)

Page 40: V24 Hybrid-methods for macromolecular complexes

24. Lecture WS 2008/09

Bioinformatics III 40

Summary

Fitting objects into densities has become a standard area of structural

bioinformatics.

Main technique: compute the correlation of two densities.

This can be efficiently done after Fourier transformation of the densities.

Laplace filtering of the densities enhances the contrast.

SOM: attempts matching of surface details

(fast speed due to reduction of search space).

Mod-EM: employs structure fitting as tool to support homology modelling in the

twilight zone.